DEV Community: Rishab Dugar

Fine-Tune SLMs in Colab for Free : A 4-Bit Approach with Meta Llama 3.2

Rishab Dugar — Sun, 11 May 2025 02:39:52 +0000

Fine-tuning large language models (LLMs) sounds complex — until you meet Unsloth. Whether you’re a complete beginner or an experienced ML tinkerer, this guide walks you through the simplest and most efficient way to fine-tune LLaMA models on free GPUs using Google Colab. Best of all? No fancy hardware or deep ML theory required.

This article breaks down every keyword, library, and function, defining each term precisely but in the simplest language possible.

In this article, you’ll learn how to:

Install and configure Unsloth in Colab
Load models in quantized (4-bit) mode to save memory
Understand core concepts (parameters, weights, biases, quantization, etc.)
Apply PEFT and LoRA adapters to fine-tune only a small part of the model
Prepare Q&A data for training with Hugging Face Datasets and chat templates
Use SFTTrainer for supervised fine-tuning
Switch to inference mode for faster generation
Save and reload your fine-tuned model

Getting Comfortable With some Core Concepts

Disclaimer: I promise this will be the friendliest GenAI glossary — your cheat sheet, wittier than autocorrect and way less judgmental! 😉

Language Model — Word-Predictor to put it simply! — like a smart autocomplete that predicts the next word based on what came before, much like your phone suggests “you” after “how are”. It learns these patterns by “reading” massive amounts of text, so it knows common word sequences and can fill in blanks. Behind the scenes, it models the probabilities of word sequences, assigning higher scores to more natural continuations.

Attention — Imagine you’re reading a sentence and want to know which earlier words matter most to understand each new word — that’s attention. Instead of reading a sentence strictly left-to-right, attention lets every word weigh how much it should consider all the others, like skimming a page and highlighting only key phrases. This selective focus makes predictions more accurate and efficient, ignoring irrelevant details.

Parameter — A number inside a model that can change during learning (like a dial that the model tweaks)

Weight — Mostly synonymous with parameters controlling how strongly one part of input affects the output

Data vs Parameters vs Weights — “Data” refers to the information used to train a model, while “parameters” are the values the model learns from that data, and “weights” are a specific type of parameter representing the strength of connections between model variables. To put it more simply, Data is the input, parameters are what the model adjusts to make predictions, and weights are a subset of those parameters.

Bias — A small extra number added so the model can shift outputs up or down, like a baseline adjustment

Transformer — A Transformer is a special model built around attention, letting it “look” at every word in a sentence in parallel rather than one by one. It’s like a study group where everyone reads the entire essay at once and then discusses which sentences are most important to the main idea. Introduced in 2017 in Google’s “Attention Is All You Need” paper, Transformers are the backbone of today’s LLMs and power everything from translation tools to chatbots.

Quantization — Reducing precision of weights (e.g. 16-bit → 4-bit) to slash memory use, with minimal accuracy loss.

PEFT — (Parameter-Efficient Fine-Tuning) — updating only tiny adapter layers instead of the whole model.

LoRA — (Low-Rank Adaptation) — A smart shortcut for teaching a huge AI model new tricks by only tweaking a tiny part of it instead of retraining the entire thing. You “freeze” most of the model’s parameters and insert two small, trainable matrices into each layer. During fine-tuning, only these add-ons learn, drastically cutting down on time and compute cost.

LoRA “r” — The adapter’s rank (size). Higher r gives more capacity but uses more memory.

LoRA α (alpha) — A scaling factor for adapter updates — like a “volume knob” for learning strength.

Dropout — Randomly turning off some adapter connections during training to prevent overfitting (can be set to 0).

Gradient Checkpointing — Recomputes parts of the model during backpropagation to halve peak VRAM usage, at a slight speed cost.

4-bit Mode — Quantized mode storing weights in 4 bits, cutting memory by ~4× compared to 16/32-bit.

Inference Mode — After training, use a special mode optimized for fast text generation (2× speed).

Overfitting — When a model “memorizes” a tiny dataset and fails on new inputs — always test on unseen data.

Checkpoint — A saved snapshot of model weights you can reload later.

Token — A token is a small chunk of text (~4–5 characters) — a word, part of a word, punctuation mark, or symbol — that serves as the basic unit a model processes.

Tokenizer — The tokenizer is the program that “cuts” raw text into those tokens and then converts each token into a unique number (ID) the model can work with (e.g., “unhappiness” → “un”, “happi”, “ness” → “un” = 137, “happi” = 428, etc).

SLMs vs. LLMs

SLMs (Small Language Models) have fewer parameters and focus on specific tasks or domains — think of them as pocket calculators solving one type of math problem quickly and efficiently.
LLMs (Large Language Models) are like supercomputers trained on vast, diverse data; they can tackle many tasks — writing essays, summarizing articles, or coding — because they’ve “read” almost the entire internet.
SLMs require less compute power and are ideal for on-device or specialized applications, whereas LLMs need massive cloud resources but offer broader versatility.
In practice, you might use an SLM for customer-service chat on your phone, but call an LLM when you need deep research help or creative story generation.

1. Getting Started: Colab Setup

Why Google Colab & Tesla T4?

Cost: Free GPU access, that's all!
Performance: Tesla T4 handles mid-size LLMs effectively when paired with quantization and PEFT
Accessibility: No local GPU required — ideal for beginners

Installing Unsloth

# Stable release from PyPI:
pip install unsloth

# OR

# Install the Nightly (latest GitHub) for cutting-edge features:
pip uninstall unsloth -y
pip install --upgrade --no-cache-dir --no-deps \
  git+https://github.com/unslothai/unsloth.git@nightly \
  git+https://github.com/unslothai/unsloth-zoo.git

pip install unsloth: grabs the vetted, stable version
uninstall & install: fetches the newest commits from GitHub (may include experimental updates)

2. Loading a Model Efficiently

We load the Llama 3.2 1B model using Unsloth in a memory-efficient 4-bit quantization mode, using roughly one-quarter the memory of full precision so it runs faster and fits on small GPUs. It also sets how long each input can be (up to 2048 tokens).

from unsloth import FastLanguageModel
import torch

# Configuration
max_seq_length = 2048      # How many tokens each input can have
dtype = None               # None for auto detection. Float16 for Tesla T4, V100; Bfloat16 for Ampere+
load_in_4bit = True        # Use 4-bit quantization to reduce memory usage
model_name = "unsloth/Llama-3.2-1B-Instruct"

# Load both model and tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

FastLanguageModel.from_pretrained: downloads and prepares the model + tokenizer
max_seq_length: sets the maximum context length

3. Introducing PEFT & LoRA

Instead of updating all model weights (which can be billions), PEFT adds small adapter layers you train. LoRA is one such method.

model = FastLanguageModel.get_peft_model(
    model,
    r=16,          # Adapter rank (size). Suggested: 8, 16, 32, 64, 128
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "up_proj", "down_proj"
    ],
    lora_alpha=16,                  # Scales adapter updates
    lora_dropout=0,                 # No dropout
    bias="none",                    # Skip bias updates
    use_gradient_checkpointing="unsloth",  # Unsloth-optimized checkpointing
    random_state=3407,              # For reproducibility
    use_rslora=False,               # Optional advanced LoRA variant
    loftq_config=None,              # Optional LoftQ config
)

r controls the size of the LoRA layers; higher uses more memory.
lora_alpha is like a “volume knob” for learning strength.
use_gradient_checkpointing="unsloth" trades compute for lower VRAM.

4. Preparing Your Dataset for Training

When preparing datasets for fine-tuning models like LLaMA 3.1 and Phi-4, format multi-turn conversations per each model’s expected structure.

🦙 LLaMA 3.1: Chat Template Format

LLaMA 3.1 wraps each message with special tokens:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>
Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Hi there! How can I assist you today?<|eot_id|>

<|begin_of_text|> marks the start
<|start_header_id|>/<|end_header_id|> mark roles
<|eot_id|> ends each message

Use Unsloth’s standardize_sharegpt to convert existing data into this format.

Phi-4 uses ChatML JSON:

{
  "messages": [
    {"role": "system",    "content": "You are a helpful assistant."},
    {"role": "user",      "content": "Can you explain machine learning?"},
    {"role": "assistant", "content": "Certainly! Machine learning is..."}
  ]
}

🔄 Converting Between Formats

Identify current format (CSV, ShareGPT, ChatML).
Convert to a ShareGPT-like structure (from/value).
Standardize to role/content with standardize_sharegpt.
Apply chat template via get_chat_template and apply_chat_template.

Below, set USE_CSV = True or False to choose your data source.

Custom Dataset (CSV)

We used a fictional 30-question dataset from “Eastern Caverns” to illustrate fine-tuning on domain-specific Q&A.

Configuration & Imports

# Configuration & Imports
USE_CSV     = True                 # False → load a ShareGPT dataset instead
CSV_PATH    = "your_data.csv"      # CSV must have 'question' & 'answer' columns
SHAREGPT_DS = "mlabonne/FineTome-100k"  # HF ShareGPT-style dataset

import pandas as pd
from datasets import Dataset, load_dataset
from unsloth.chat_templates import get_chat_template, standardize_sharegpt

Loading & Wrapping Data

if USE_CSV:
    df = pd.read_csv(CSV_PATH)
    ds = Dataset.from_pandas(df)

    def to_sharegpt_format(ex):
        return {
            "conversations": [
                {"from": "system", "value": "You are a Victorian-era assistant…"},
                {"from": "human",  "value": ex["question"]},
                {"from": "gpt",    "value": ex["answer"]},
            ]
        }
    ds = ds.map(to_sharegpt_format, remove_columns=df.columns.tolist())
else:
    ds = load_dataset(SHAREGPT_DS, split="train")

Standardizing & Applying the Chat Template

ds = standardize_sharegpt(ds)

CHAT_TEMPLATE = "llama-3.1"
tokenizer = get_chat_template(tokenizer, chat_template=CHAT_TEMPLATE)

def format_prompts(examples):
    texts = [
        tokenizer.apply_chat_template(
            convo,
            tokenize=False,
            add_generation_prompt=False
        )
        for convo in examples["conversations"]
    ]
    return {"text": texts}

nds = ds.map(format_prompts, batched=True)

print("===== BEFORE =====")
print(nds[0]["conversations"])
print("===== AFTER =====")
print(nds[0]["text"])

5. Supervised Fine Tuning with SFTTrainer

Too many things here 😮‍💨 — lets explore one by one,

🛠️ SFTTrainer

Purpose: Trainer for Supervised Fine-Tuning (SFT) of LLMs.
Why: Streamlines fine-tuning with built-in utilities.

Key Components

model, tokenizer, train_dataset — The model, tokenizer, and dataset (nds).
dataset_text_field="text" — Uses the "text" field for input.
DataCollatorForSeq2Seq — Pads and batches seq2seq data.
TrainingArguments — Hyperparameters (batch size, learning rate, epochs, etc.).
per_device_train_batch_size=2 — Examples per device.
gradient_accumulation_steps=4 — Simulate larger batches.
warmup_steps=5 — Smooth learning rate start.
max_steps=100 — Total training steps.
optim="adamw_8bit" — 8-bit AdamW optimizer for memory savings.
output_dir="outputs" — Where checkpoints go.

from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=nds,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer),
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        num_train_epochs=1,
        max_steps=100,
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        output_dir="outputs",
        report_to="none",
    ),
)

6. Kicking-off the training

Use Unsloth’s train_on_responses_only to compute loss only on the assistant’s output:

from unsloth.chat_templates import train_on_responses_only

trainer = train_on_responses_only(
    trainer,
    instruction_part="<|start_header_id|>user<|end_header_id|>\n\n",
    response_part="<|start_header_id|>assistant<|end_header_id|>\n\n",
)

stats = trainer.train()

You’ll see the loss drop steadily—proof your 4-bit Llama 3.2 is learning efficiently.

7. Inference & Saving Your Model

Fast Inference Mode

model = FastLanguageModel.for_inference(model)

inputs = tokenizer.apply_chat_template(
    [{"role": "user", "content": "<Your Question Here>"}],
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to("cuda")

outputs = model.generate(input_ids=inputs, max_new_tokens=256)
print(tokenizer.batch_decode(outputs)[0])

Save & Reload

# Save
model.save_pretrained("/content/drive/MyDrive/my_llama3_model_eastern_caverns")
tokenizer.save_pretrained("/content/drive/MyDrive/my_llama3_model_eastern_caverns")

# Reload in 4-bit mode
from transformers import AutoTokenizer
from unsloth import FastLanguageModel

tokenizer = AutoTokenizer.from_pretrained("/content/drive/MyDrive/my_llama3_model_eastern_caverns")
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="/content/drive/MyDrive/my_llama3_model_eastern_caverns",
    load_in_4bit=True,
    max_seq_length=2048,
)

# Quick test
inputs = tokenizer("What backup options are available for the CavernDB cluster?", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Conclusion

Fine-tuning a large language model is all about balancing precision, speed, and the right data.

Precision vs. Quantization

Full precision (FP32) → ~4.3 billion weight values
4-bit quantization → 16 levels (tiny rounding error for big memory savings)

Why 4-Bit Helps

7 B params in FP16 needs ~28 GB RAM; in 4-bit ~7 GB
Avoids out-of-memory on free GPUs (Colab, Kaggle)

Unsloth’s Speed Boost

Up to 2× faster fine-tuning, ~70% VRAM reduction
Memory-efficient kernels without accuracy loss

Picking the Right Dataset

Too small → overfitting; too big/wild → underfitting
Aim for focused, high-quality examples in proper order

That’s your bird’s-eye view: squeezing precision, saving resources, turbo-charging training with Unsloth, and choosing data wisely. Feedback and questions are always welcome!

One last thing — if you’ve made it this far, I’ve dropped the Colab notebook link in the comments below. Feel free to dive in and give it a spin 😉 !

The Best Way to Create AWS Lambda λ Layers for Python 🐍 (Fast & Easy) : A Complete Guide

Rishab Dugar — Wed, 01 Jan 2025 17:54:18 +0000

Managing dependencies for AWS Lambda functions can be challenging, especially when using Python packages like NumPy, pandas, or requests that contain compiled binaries. This guide provides a comprehensive approach to creating AWS Lambda layers, ensuring compatibility with Lambda’s Linux-based runtime, even when developing on non-Linux systems.

What Are AWS Lambda Layers?

AWS Lambda layers are .zip archives that let you share libraries, custom runtimes, or other dependencies across functions. They simplify updates, promote modularity, and reduce deployment sizes.

Common Challenges

Python packages with compiled code may not work out-of-the-box on AWS Lambda, especially if you’re developing on Windows or macOS. Errors like “Unable to import module” often arise because pip installs packages for your local architecture rather than Lambda’s Linux environment.

To address this, you need to build your deployment package or layer in a way that is compatible with AWS Lambda’s Linux runtime.

Step-by-Step Guide to Creating AWS Lambda Layers

1. Prepare Your Environment

Ensure the latest Python version (or the runtime version of your Lambda function) is installed.
Check your pip version and upgrade it if it’s older than 19.3.0:

    py -m pip install --upgrade pip

2. Create the Directory Structure

Create a working directory for your Lambda layer:

    mkdir -p layer/python

Open the directory in your IDE (e.g., Visual Studio Code).

3. Install Dependencies

To ensure compatibility, use pip’s --platform parameter to target AWS Lambda’s Linux environment:

    pip install `  
        --platform manylinux2014_x86_64 `  
        --target=layer/python `  
        --implementation cp `  
        --python-version 3.12 `  
        --only-binary=:all: --upgrade `  
        pandas numpy requests

This installs the dependencies directly into the layer/python directory, ready for packaging.

Note: Ensure your installed packages don’t exceed Lambda’s 50 MB limit for direct uploads. You’ll need to upload the .zip file to an S3 bucket for larger packages.

4. Package the Layer

Navigate to the layer directory and create a .zip archive:

    cd layer
    zip -r ../python_layer.zip .

5. Deploy the Layer to AWS Lambda

Log in to the AWS Management Console.
Go to Lambda > Layers and click Create Layer.
Upload the python_layer.zip file, select the compatible runtime (e.g., Python 3.12), and save the layer.

6. Attach the Layer to Your Lambda Function

Open your Lambda function in the AWS Console.
Under the Layers section, click Add a layer.
Select your custom layer, specify the version, and attach it.

Test Your Lambda Function

Here’s an example Lambda function to verify the layer works:

    import pandas as pd
    import numpy as np
    import requests

    def lambda_handler(event, context):
        return {
            "statusCode": 200,
            "body": {
                "pandas_version": pd.__version__,
                "numpy_version": np.__version__,
                "response_status": requests.get("https://example.com").status_code
            }
        }

Deploy the function, add the layer, and invoke it to confirm the dependencies are working correctly.

Some important Tips :

Cross-Platform Compatibility: This process works across Windows, macOS, and Linux.
Avoid Empty Directories: Before zipping, ensure no unnecessary or empty directories remain in the layer folder.
File Size Check: Always check your .zip file size before uploading to Lambda (use S3 for files larger than 50 MB).
Local Testing: Use Docker to emulate the Lambda environment.

Conclusion

By following these steps, you can efficiently create and deploy AWS Lambda layers for Python runtimes. This approach ensures compatibility with AWS Lambda’s Linux-based environment while promoting reusability and reducing deployment sizes.

If you’re working with larger packages or need alternative methods (e.g., Docker and Amazon ECR), check out additional resources for advanced deployment techniques.

References

PDF Extraction: Retrieving Text and Tables together using Python🐍

Rishab Dugar — Sun, 22 Sep 2024 13:42:48 +0000

Extracting both text and tables can be challenging when working with PDF files due to their complex structure. However, the “pdfplumber” library offers a powerful solution. This article explores an effective method for combining text and table extraction from PDFs using pdfplumber. Special thanks to Karl Genockey a.k.a. cmdlineuser and other contributors for their brilliant approach discussed here.

Understanding the Approach

The method involves extracting table objects and text lines separately and then combining them based on their positional values. This ensures that the extracted data maintains the correct order and structure as it appears in the PDF. Let’s break down the code and logic step-by-step.

As an example, we will use the sample_pdf below, containing tables and text in multiple pages.

sample_pdf_with_text_and_table.pdf - Google Drive

drive.google.com

Prerequisites

Before running the code, we should ensure that the necessary libraries are installed. Besides pdfplumber and pandas, we also need the tabulate library. This library is used by pandas To convert DataFrame objects to Markdown format, which is crucial for our table extraction process. This conversion helps in maintaining the structure and readability of table data extracted from the PDF.

Installing Required Libraries

You can install these libraries using pip. Run the following commands in your

pip install pdfplumber pandas tabulate

Step-by-Step Explanation

Import Libraries: First things first, we start by importing all necessary libraries.

pdfplumber is used for extracting text and tables from PDFs.
pandas is used for handling and manipulating data.
extract_text, get_bbox_overlap, and obj_to_bbox are utility functions from pdfplumber.
tabulate helps in converting data into Markdown format.

import pdfplumber
import pandas as pd
from pdfplumber.utils import extract_text, get_bbox_overlap, obj_to_bbox
import tabulate

Function Definition and PDF Opening:

The function process_pdf takes pdf_path as an argument, which is the path to the PDF file.
pdfplumber.open(pdf_path) opens the PDF file.
all_text is initialized as an empty list to store the extracted text from all pages.

    def process_pdf(pdf_path):
      pdf = pdfplumber.open(pdf_path)
      all_text = []

Iterate Over Pages:

for page in pdf.pages — The for loop iterates over each page in the PDF.
filtered_page — is initially set to the current page.
chars — captures all characters on the filtered_page.

      for page in pdf.pages:
        filtered_page = page
        chars = filtered_page.chars

Table Detection and Filtering:

for table in page.find_tables() — The for loop iterates over each table found on the page.
first_table_char — stores the first character of the cropped table area.
filtered_page — is updated by filtering out characters that overlap with the table's bounding box using get_bbox_overlap and obj_to_bbox.

        for table in page.find_tables():
            first_table_char = page.crop(table.bbox).chars[0]
            filtered_page = filtered_page.filter(lambda obj: 
                get_bbox_overlap(obj_to_bbox(obj), table.bbox) is None
            )
            chars = filtered_page.chars

Extract and Convert Table to Markdown:

table.extract() extracts the table content.
A DataFrame df is created from the extracted table data.
The first row is set as the header using df.columns = df.iloc[0].
The rest of the DataFrame is converted to Markdown format and stored in markdown.

            df = pd.DataFrame(table.extract())
            df.columns = df.iloc[0]
            markdown = df.drop(0).to_markdown(index=False)

Append Markdown to Characters:

The first_table_char is updated with the markdown content and appended to chars.

chars.append(first_table_char | {"text": markdown})

Extract Page Text:

extract_text(chars, layout=True) extracts the text from the filtered characters with layout preservation.
The extracted text page_text is appended to all_text.

        page_text = extract_text(chars, layout=True)
        all_text.append(page_text)

Close PDF and Return Text:

The PDF file is closed using pdf.close().
The extracted text from all pages is joined into a single string with newline characters and returned.

    pdf.close()
    return "\n".join(all_text)

Execute Function and Print Result:

The path to the PDF file is defined in pdf_path.
process_pdf(pdf_path) is called to process the PDF and extract text.
The extracted text is printed.

# Path to your PDF file
pdf_path = r"sample_pdf.pdf"
extracted_text = process_pdf(pdf_path)
print(extracted_text)

Complete Code

Here is the complete script for extracting text and tables as markdown from a PDF:

import pdfplumber
import pandas as pd
from pdfplumber.utils import extract_text, get_bbox_overlap, obj_to_bbox
def process_pdf(pdf_path):
    pdf = pdfplumber.open(pdf_path)
    all_text = []
    for page in pdf.pages:
        filtered_page = page
        chars = filtered_page.chars
        for table in page.find_tables():
            first_table_char = page.crop(table.bbox).chars[0]
            filtered_page = filtered_page.filter(lambda obj: 
                get_bbox_overlap(obj_to_bbox(obj), table.bbox) is None
            )
            chars = filtered_page.chars
            df = pd.DataFrame(table.extract())
            df.columns = df.iloc[0]
            markdown = df.drop(0).to_markdown(index=False)
            chars.append(first_table_char | {"text": markdown})
        page_text = extract_text(chars, layout=True)
        all_text.append(page_text)
    pdf.close()
    return "\n".join(all_text)
# Path to your PDF file
pdf_path = r"sample_pdf.pdf"
extracted_text = process_pdf(pdf_path)
print(extracted_text)

Output :

Hello
World
| First name   | Last name   |   Age | City        |
|:-------------|:------------|------:|:------------|
| Nobita       | Nobi        |    15 | Tokyo       |
| Eli          | Shane       |    23 | Orlando     |
| Rahul        | Jain        |    22 | Los Angeles |
| Lucy         | Carlyle     |    17 | London      |
| Anthony      | Lockwood    |    19 | Leicester   |
Loreum  ipsum
dolor sit amet,
consectetur
adipiscing
Hello
Python
| First name   | Last name   | Address             |
|:-------------|:------------|:--------------------|
| James        | Watson      | 221 B, Baker Street |
| Mycroft      | Holmes      | Diogenes Club       |
| Irene        | Adler       | 21 New Jersey       |
| Lucy         | Carlyle     | 33 Claremont Square |
| Anthony      | Lockwood    | 35 Portland Row     |
Neque  porro
quisquam  est qui
            dolorem
      ipsum     quia
      dolor sit amet,
consectetur, adipisci
velit..."

Conclusion

This approach provides a systematic way to extract and combine text and tables from PDFs using “pdfplumber”. By leveraging table and text line positional values, we can maintain the integrity of the original document’s layout. Credits to cmdlineuser and jsvine for their insightful discussion and innovative solution to the problem!

That’s all for now! Hope this tutorial was helpful. Feel free to explore and adapt this method to fit your specific needs.

Managing Chat History in AWS Bedrock Models: A Deep Dive into Llama 3 🦙 and Anthropic Claude 🤖

Rishab Dugar — Fri, 20 Sep 2024 10:41:23 +0000

When developing conversational AI systems, handling multi-turn conversations effectively is crucial for maintaining a coherent dialogue and providing contextually relevant responses.
Amazon Bedrock is a fully managed service that makes foundation models accessible via an API. Creating chatbots that handle multi-turn conversations requires maintaining and utilizing context from previous interactions. This ensures relevant and coherent responses, enhancing user satisfaction.
Llama 3, developed by Meta, offers robust capabilities for managing such interactions.
This section delves into the specifics of structuring prompts for Llama 3 and provides a Python example to invoke this model using AWS Bedrock.

Understanding Prompt Tokens in Llama 3

Llama 3 utilizes specific tokens to manage conversation flow, ensuring clarity and context retention across multiple turns. Here’s an overview of key tokens:

<|start_header_id|> and <|end_header_id|>: These tokens define the role of each message segment within the conversation (e.g., system, user, assistant). Encapsulating messages with these tags helps the model understand who is speaking and adjust its responses accordingly.
<|eot_id|>: The "End of Turn" token signifies that the model has completed its response for the current turn. This is crucial in multi-turn conversations to delineate where one turn ends and another begins.
<|eom_id|>: "End of Message" indicates a potential continuation point within a conversation where a tool call might be needed. This token is particularly useful when integrating external tools or APIs that require back-and-forth interaction within a single turn.

These tokens play pivotal roles in structuring inputs and outputs for Llama 3, enabling it to handle complex conversational scenarios effectively.

Example: Structuring Prompts for Multi-Turn Conversation

Consider a scenario where you are creating an AI assistant capable of conducting an interactive session about travel recommendations:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are an AI trained to provide travel advice.<|eot_id|
<|start_header_id|>user<|end_header_id|>
What are some top destinations in Europe?<|eot_id|
<|start_header_id|>assistant<|end_header_id|
Top destinations include Paris, Rome, and Barcelona.<eot_id|

In this example:

Each participant's role is clearly marked using <start_header_id> and <end_header_id>.
The <eot_id> token after each message ensures that each turn is distinctly recognized by the model.

Python Example: Invoking Llama 3 via AWS Bedrock

Below we demonstrate invoking Llama 3 using AWS Bedrock API in Python:

import json
import logging
import boto3
from botocore.exceptions import ClientError

logger = logging.getLogger()
logger.setLevel(logging.INFO)

# Create a Bedrock Runtime client in the AWS Region of your choice.
client = boto3.client("bedrock-runtime", region_name="us-east-1")

# Set the model ID, e.g., Llama 3 8b Instruct.
model_id = "meta.llama3-8b-instruct-v1:0"

class AWSLambdaLLAMA3:
    def __init__(self):
        self.temperature = 0.5
        self.maxTokens = 512
        self.topP = 0.9

    def construct_prompt(self, question, chat_history):
        """Create prompt for LLAMA3
        Args:
            question: str; query from user
            chat_history: list; list of formatted conversation history
        Returns:
            Prompt text to be used in the model
        """
        header = "<|start_header_id|>system<|end_header_id|>\nYour system prompt here<|eot_id|>\n"
        chat_history_formatted = self.format_chat_history(chat_history)
        msg = f"<|start_header_id|>user<|end_header_id|>\n{question}<|eot_id|>\n<|start_header_id|>assistant<|end_header_id|>\n"
        final_prompt = header + chat_history_formatted + msg
        logger.info(f"Final Prompt: {final_prompt}")
        return final_prompt

    def format_chat_history(self, chat_history):
        """Format chat history for LLAMA3
        Args:
            chat_history: list; list of conversation history
        Returns:
            Formatted chat history string
        """
        formatted_history = ""
        for entry in chat_history:
            user_input = entry['user']
            assistant_response = entry['assistant']
            formatted_history += f"<|start_header_id|>user<|end_header_id|>\n{user_input}<|eot_id|>\n<|start_header_id|>assistant<|end_header_id|>\n{assistant_response}<|eot_id|>\n"
        return formatted_history

    def get_response(self, question, chat_history):
        """Generate response using LLAMA3 model via AWS Bedrock
        Args:
            question: str; question from the user
            chat_history: list; list of formatted conversation history
        Returns:
            The generated AI response text
        """
        prompt = self.construct_prompt(question, chat_history)
        native_request = {
            "prompt": prompt,
            "max_gen_len": self.maxTokens,
            "temperature": self.temperature,
            "top_p": self.topP
        }
        request = json.dumps(native_request)
        logger.info(f"Request Body: {request}")

        try:
            response = client.invoke_model(modelId=model_id, body=request)
            response_body = json.loads(response['body'].read())
            logger.info(f"Response Body: {response_body}")
            return response_body['generation'].replace('"', '')
        except (ClientError, Exception) as e:
            logger.error(f"ERROR: Can't invoke '{model_id}'. Reason: {e}")
            return None

# Example usage
lambda_llama3 = AWSLambdaLLAMA3()
chat_history = [
    {"user": "Hello, how are you?", "assistant": "I'm good, thank you! How can I assist you today?"},
    {"user": "Can you tell me a joke?", "assistant": "Sure! Why don't scientists trust atoms? Because they make up everything!"}
]
response = lambda_llama3.get_response("What's the weather like today?", chat_history)
print(response)

Final Prompt:

<|start_header_id|>system<|end_header_id|>
Your system prompt here<|eot_id|>
<|start_header_id|>user<|end_header_id|>
Hello, how are you?<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
I'm good, thank you! How can I assist you today?<|eot_id|>
<|start_header_id|>user<|end_header_id|>
Can you tell me a joke?<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
Sure! Why don't scientists trust atoms? Because they make up everything!<|eot_id|>
<|start_header_id|>user<|end_header_id|>
What's the weather like today?<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>

Challenges with Llama 3

While this format is straightforward, managing longer conversations can become complex due to token limits. As conversations grow longer, earlier parts may need pruning or summarizing to fit within constraints.

Best Practices:

Limit Number of Turns: Pass only recent interactions as older contexts might become irrelevant.
Use Sliding Window: Retain only the last few turns within token limits.
Optimize System Prompts: Ensure concise prompts that set clear contexts for focused responses.

Mastering Multi-Turn Conversations with Anthropic Claude on AWS Bedrock

Anthropic Claude is one of the advanced language models available on Bedrock, designed to understand and generate human-like text. By integrating Claude with Bedrock, developers can harness powerful NLP capabilities without managing the underlying infrastructure. It offers robust capabilities to manage such interactions seamlessly. Let's explore how to pass chat history to Anthropic Claude, enabling our applications to engage in meaningful, context-aware dialogues.

APIs for Claude on Amazon Bedrock

Anthropic Claude on Amazon Bedrock offers two distinct APIs tailored to different versions of the model:

Text Completion API (Claude v1 and v2.x)

The Text Completion API is used by earlier versions of Claude (v1 and v2.x). It allows developers to generate text completions based on a given prompt. While effective for single-turn interactions, managing multi-turn conversations requires additional handling of context.

To pass chat history for Claude's text completion API (versions 1 and 2.x), you need to structure the prompt manually by appending both user inputs and AI responses in a dialogue-like format. Each turn in the conversation is represented as a string, where "Human" represents the user input and "Assistant" represents Claude's response. For example:

# Chat history formatted for text completion API
chat_history = [
    "Human: What is the capital of France?",
    "Assistant: The capital of France is Paris.",
    "Human: Tell me more about Paris."
]
prompt = "\n".join(chat_history) + "\n\nAssistant:"

# Now use the prompt to generate the next response

This format allows Claude to "remember" previous interactions by feeding the chat history into the prompt, maintaining the context of the conversation straightforwardly. The prompt is then passed to the text completion API, which generates the next assistant response.

Messages API (Claude v3)

Introduced with Claude version 3, the Messages API facilitates multi-turn conversations by natively supporting passing chat history. This enables the model to maintain context across multiple interactions seamlessly. This API simplifies implementing context-aware dialogues, making it ideal for modern conversational applications.

Implementing Multi-Turn Conversations

To create a multi-turn conversation with Anthropic Claude on AWS Bedrock, follow these key steps:

Formatting Chat History

First, format the existing chat history into a structure that Claude can understand by mapping user inputs and AI responses appropriately.

def multi_turn_bedrock_request(chat_history):
    """
    Convert chat history to the required format for Bedrock Anthropic Claude.

    Parameters:
        chat_history (list): A list of chat history where 'Human' corresponds 
                             to 'user' and 'AI' to 'assistant'.

    Returns:
        messages: The formatted request body for Bedrock Anthropic Claude.
    """

    messages = []

    # Convert chat history into expected format
    for entry in chat_history:
        if "Human" in entry:
            messages.append({
                "role": "user",
                "content": [{'type': 'text', 'text': entry["Human"]}]
            })
        elif "AI" in entry:
            messages.append({
                "role": "assistant",
                "content": [{'type': 'text', 'text': entry["AI"]}]
            })

    print("Multi turn chat history formatted:", messages)
    return messages


chat_history = [ 
    {"Human": "If I start the trip at 8:00 AM and stop for lunch at noon, what time will I reach Berlin?"},
    {"AI": "If you stop for lunch at noon for 1 hour and then continue driving, you would reach Berlin around 7:30 PM, assuming you maintain a speed of 100 kilometers per hour."},
    {"Human": "What if I encounter traffic that delays me by 30 minutes?"},
    {"AI": "If you're delayed by 30 minutes due to traffic, you'll reach Berlin by 8:00 PM."},
    {"Human": "Interesting! Now, let's switch topics. What’s the largest planet in the Solar System?"},
    {"AI": "The largest planet in the Solar System is Jupiter."},
    {"Human": "How many moons does it have?"}, 
    {"AI": "Jupiter has 95 known moons, with the four largest being the Galilean moons: Io, Europa, Ganymede, and Callisto."},
    {"Human": "If I were traveling at the speed of light, how long would it take to reach there from Earth?"}, 
    {"AI": "At the speed of light, it would take approximately 43.3 minutes to reach Jupiter from Earth when they are at their closest approach."}
]

formatted_messages = multi_turn_bedrock_request(chat_history)

Sample Output (formatted_messages)
This structured output is sent to the Anthropic Claude API for processing, maintaining the context of the conversation.

[
    {
        "role": "user",
        "content": [{'type': 'text', 'text': "If I start the trip at 8:00 AM and stop for lunch at noon, what time will I reach Berlin?"}]
    },
    {
        "role": "assistant",
        "content": [{'type': 'text', 'text': "If you stop for lunch at noon for 1 hour and then continue driving, you would reach Berlin around 7:30 PM, assuming you maintain a speed of 100 kilometers per hour."}]
    },
    {
        "role": "user",
        "content": [{'type': 'text', 'text': "What if I encounter traffic that delays me by 30 minutes?"}]
    },
    {
        "role": "assistant",
        "content": [{'type': 'text', 'text': "If you're delayed by 30 minutes due to traffic, you'll reach Berlin by 8:00 PM."}]
    },
    {
        "role": "user",
        "content": [{'type': 'text', 'text': "Interesting! Now, let's switch topics. What’s the largest planet in the Solar System?"}]
    },
    {
        "role": "assistant",
        "content": [{'type': 'text', 'text': "The largest planet in the Solar System is Jupiter."}]
    },
    {
        "role": "user",
        "content": [{'type': 'text', 'text': "How many moons does it have?"}]
    },
    {
        "role": "assistant",
        "content": [{'type': 'text', 'text': "Jupiter has 95 known moons, with the four largest being the Galilean moons: Io, Europa, Ganymede, and Callisto."}]
    },
    {
        "role": "user",
        "content": [{'type': 'text', 'text': "If I were traveling at the speed of light, how long would it take to reach there from Earth?"}]
    },
    {
        "role": "assistant",
        "content": [{'type': 'text', 'text': "At the speed of light, it would take approximately 43.3 minutes to reach Jupiter from Earth when they are at their closest approach."}]
    }
]

Constructing the Request Payload

Once formatted, combine the chat history with a new user prompt to form a complete request payload. This payload includes metadata like version information and settings.

chat_history = [ 
    {"Human": "If I start the trip at 8:00 AM and stop for lunch at noon, what time will I reach Berlin?"},
    {"AI": "If you stop for lunch at noon for 1 hour and then continue driving, you would reach Berlin around 7:30 PM, assuming you maintain a speed of 100 kilometers per hour."},
    {"Human": "What if I encounter traffic that delays me by 30 minutes?"},
    {"AI": "If you're delayed by 30 minutes due to traffic, you'll reach Berlin by 8:00 PM."},
    {"Human": "Interesting! Now, let's switch topics. What’s the largest planet in the Solar System?"},
    {"AI": "The largest planet in the Solar System is Jupiter."},
    {"Human": "How many moons does it have?"}, 
    {"AI": "Jupiter has 95 known moons, with the four largest being the Galilean moons: Io, Europa, Ganymede, and Callisto."},
    {"Human": "If I were traveling at the speed of light, how long would it take to reach there from Earth?"}, 
    {"AI": "At the speed of light, it would take approximately 43.3 minutes to reach Jupiter from Earth when they are at their closest approach."}
]

formatted_messages = multi_turn_bedrock_request(chat_history)

prompt = "What if I reduce my speed by 50%?" #followup question related to chat history

body = {
  "anthropic_version": "bedrock-2023-05-31",
  "system": "You are an AI assistant that remembers past conversations.",
  "messages": formatted_messages + [
      {
          "role": "user",
          "content": [{"type": "text", "text": prompt}]
      }
  ],
  "max_tokens": 256,
  "temperature": 0.01
}

Invoking Anthropic Claude

Invoke Anthropic Claude using AWS Bedrock's runtime client by sending the constructed payload.

import boto3
import json
from botocore.exceptions import ClientError

def invoke_anthropic_claude():

    client = boto3.client("bedrock-runtime", region_name="us-east-1")
    model_id = "<model-id>"  # Replace with your specific model ID

    try:
        response = client.invoke_model(
            modelId=model_id,
            body=json.dumps(body),
            accept='application/json',
            contentType='application/json'
        )
    except ClientError as e:
        print(f"ERROR: Can't invoke '{model_id}'. Reason: {e}")
        exit(1)

    model_response = json.loads(response["body"].read())
    response_text = model_response["content"][0]["text"]
    print("Response text:", response_text)

invoke_anthropic_claude()

Explanation

Let's break down the provided code to understand its functionality and how it facilitates multi-turn conversations with Anthropic Claude.

Chat History Formatting:
- The multi_turn_bedrock_request function takes a list of chat history entries.
- Each entry is a dictionary with either a "Human" key (representing the user) or an "AI" key (representing Claude's response).
- The function maps "Human" to the role "user" and "AI" to the role "assistant," formatting the content accordingly.
Request Payload Construction:
- The formatted chat history is combined with a new user prompt.
- Additional metadata such as the Anthropic version, system prompt, maximum tokens, and temperature settings are included.
- This structured payload ensures that Claude understands the context and generates appropriate responses.
AWS Bedrock Invocation:
- A Bedrock Runtime client is created using Boto3, AWS's SDK for Python.
- The request payload is sent to the specified model ID.
- The response is decoded, and the generated text is extracted and printed.

Complete Code

import boto3
import json
from botocore.exceptions import ClientError

def multi_turn_bedrock_request(chat_history):
    """
    Convert chat history to the required format for Bedrock Anthropic Claude.

    Parameters:
    chat_history (list): A list of chat history where 'Human' corresponds to 'user' and 'AI' to 'assistant'.

    Returns:
    messages: The formatted request body for Bedrock Anthropic Claude.
    """
    messages = []

    # Convert chat history into the expected format
    for entry in chat_history:
        if "Human" in entry:
            messages.append({
                "role": "user",
                "content": [{'type': 'text', 'text': entry["Human"]}]
            })
        elif "AI" in entry:
            messages.append({
                "role": "assistant",
                "content": [{'type': 'text', 'text': entry["AI"]}]
            })

    print("Multi turn chat history formatted:", messages)
    return messages

# Example chat history
chat_history = [ 
    {"Human": "If I start the trip at 8:00 AM and stop for lunch at noon, what time will I reach Berlin?"},
    {"AI": "If you stop for lunch at noon for 1 hour and then continue driving, you would reach Berlin around 7:30 PM, assuming you maintain a speed of 100 kilometers per hour."},
    {"Human": "What if I encounter traffic that delays me by 30 minutes?"},
    {"AI": "If you're delayed by 30 minutes due to traffic, you'll reach Berlin by 8:00 PM."},
    {"Human": "Interesting! Now, let\'s switch topics. What\'s the largest planet in the Solar System?"},
    {"AI": "The largest planet in the Solar System is Jupiter."},
    {"Human": "How many moons does it have?"}, 
    {"AI": "Jupiter has 95 known moons, with the four largest being the Galilean moons: Io, Europa, Ganymede, and Callisto."},
    {"Human": "If I were traveling at the speed of light, how long would it take to reach there from Earth?"}, 
    {"AI": "At the speed of light, it would take approximately 43.3 minutes to reach Jupiter from Earth when they are at their closest approach."}
]

# Convert chat history to the required format
formatted_messages = multi_turn_bedrock_request(chat_history)

# Define the prompt for the model
prompt = "What if I reduce my speed by 50%?"  # followup question related to chat history

# Define the request payload
body = {
    "anthropic_version": "bedrock-2023-05-31",
    "system": "This is your system prompt",
    "messages": formatted_messages + [
        {
            "role": "user",
            "content": [{"type": "text", "text": prompt}]
        }
    ],
    "max_tokens": 256,
    "temperature": 0.01
}

# Complete sample code to invoke Anthropic Claude using Python on AWS
def invoke_anthropic_claude():
    # Create a Bedrock Runtime client in the AWS Region of your choice.
    client = boto3.client("bedrock-runtime", region_name="us-east-1")

    # Set the model ID, e.g., Claude 3 Haiku.
    model_id = "anthropic.claude-3-haiku-20240307-v1:0"

    # Print the request body before sending
    print("Request body:", json.dumps(body, indent=2))

    try:
        # Invoke the model with the request.
        response = client.invoke_model(modelId=model_id, body=json.dumps(body), accept='application/json', contentType='application/json')
    except (ClientError, Exception) as e:
        print(f"ERROR: Can't invoke '{model_id}'. Reason: {e}")
        exit(1)

    # Decode the response body.
    model_response = json.loads(response["body"].read())

    # Extract and print the response text.
    response_text = model_response["content"][0]["text"]
    print("Response text:", response_text)

# Invoke the function
invoke_anthropic_claude()

Expected AI Response

When the user adds a new prompt, such as requesting a horror story from Tokyo, Claude uses the established context to generate a relevant and coherent narrative.

{
  "content": [{
      "type": "text",
      "text": "If you reduce your speed to 50% of the speed of light, it would take about 86.6 minutes to travel from Earth to Jupiter at their closest distance."
  }]
}

This response demonstrates Claude's ability to remember the context of the conversation and provide a creative continuation that aligns with the user's request.

Error Handling and Monitoring

Regardless of which model you use, it’s important to handle errors gracefully. Both Llama 3 and Claude can encounter issues related to token limits or malformed inputs, which can lead to incomplete or incorrect responses. Implement robust error-handling mechanisms in your code to catch these issues early and retry requests when necessary.

Additionally, monitor the performance of the model over time. If you notice that responses are becoming less coherent as the conversation grows, you may need to adjust how much history is being passed or experiment with different system prompts.

Conclusion

Effectively managing chat history in AWS Bedrock models, such as Llama 3 and Anthropic Claude, is pivotal for developing sophisticated, context-aware AI systems. Llama 3 employs a straightforward concatenated prompt structure, whereas Claude’s message array format offers enhanced flexibility for handling multi-turn conversations. Understanding these distinctions and adhering to best practices—such as limiting conversation turns, utilizing clear system prompts, and monitoring token usage—enables the creation of more intelligent and efficient conversational applications.

For developers working on customer support chatbots, travel assistants, or any AI-driven applications, mastering chat history management is essential for optimizing both user experience and AI performance. Leveraging the Messages API in Claude v3 facilitates the maintenance of coherent dialogues that remember and build upon previous interactions, thereby ensuring seamless and engaging user interactions.

The provided sample code and enhanced conversation examples serve as a robust foundation for implementing these capabilities. As conversational AI continues to evolve, mastering these techniques will be crucial for developing applications that not only respond accurately but also understand and retain the nuances of human interactions.

Crafting Structured {JSON} Responses: Ensuring Consistent Output from any LLM 🦙🤖

Rishab Dugar — Sun, 15 Sep 2024 07:27:22 +0000

Large Language Models (LLMs) are revolutionizing how we interact with data, but getting these models to generate well-formatted & usable JSON responses consistently can feel like herding digital cats. You ask for structured data and get a jumbled mess interspersed with friendly commentary. Frustrating, right?
A reliable JSON output is crucial, whether you're categorizing customer feedback, extracting structured data from unstructured text, or automating data pipelines. This article aims to provide a comprehensive, generalized approach to ensure you get perfectly formatted JSON from any LLM, every time.

The Problem

LLMs are trained on massive text datasets, making them adept at generating human-like text. However, this strength becomes a weakness when seeking precise, structured output like JSON or Python Dictionary.
Common issues include:

Inconsistent Formatting: Random spaces, line breaks, and inconsistent quoting can break JSON parsers.
Extraneous Text: LLMs often add conversational fluff before or after the JSON, making extraction difficult.
Hallucinations: LLMs might invent data points or misinterpret instructions, leading to invalid or inaccurate JSON.

These issues can disrupt downstream processes and lead to significant inefficiencies. Let's explore some proven techniques to overcome these challenges.

The Solution: A Multi-Layered Approach

1. Guiding the LLM with Clear Instructions

Explicitly Request JSON: Clearly state that you expect the output in JSON format. Explicitly stating the intended use of the JSON output in the prompt can significantly improve its validity. Giving explicit instructions to provide a structured response in "system_prompt" can also prove helpful.

json_prompt = """Ensure the output is valid JSON as it will be parsed 
                 using `json.loads()` in Python. 
                 It should be in the schema: 
                <output>
                {
                "cars": [
                    {
                    "model": "<model_name1>",
                    "color": "<color1>"
                    },
                    {
                    "model": "<model_name2>",
                    "color": "<color2>"
                    },
                    {
                    "model": "<model_name3>",
                    "color": "<color3>"
                    },
                    {
                    "model": "<model_name4>",
                    "color": "<color4>"
                    },
                    {
                    "model": "<model_name5>",
                    "color": "<color5>"
                    }
                ]
                }
                </output>
                """
#Defining system prompt
system_prompt = "You are an AI language model that provides structured JSON outputs."

Provide a JSON Schema: Define the exact structure of the desired JSON, including keys and data types.
Use Examples: Show the LLM examples of correctly formatted JSON output for your specific use case.

As suggested in Anthropic Documentation, one more effective method is to guide the LLM by pre-filling the assistant's response with the beginning of the JSON structure. This technique leverages the model's ability to continue from a given starting point.

Example:

import boto3
import json
from botocore.exceptions import ClientError
from dotenv import load_dotenv
import os

load_dotenv()

# AWS Bedrock setup
session = boto3.Session(
    region_name=os.getenv("AWS_DEFAULT_REGION"),
    aws_access_key_id=os.getenv("AWS_ACCESS_KEY_ID"),
    aws_secret_access_key=os.getenv("AWS_SECRET_ACCESS_KEY"),
)

bedrock = session.client(service_name="bedrock-runtime")

# Create a Bedrock Runtime client in the AWS Region of your choice.
client = boto3.client("bedrock-runtime", region_name="us-east-1")

# Set the model ID for Claude.
model_id = "anthropic.claude-3-haiku-20240307-v1:0"

# Define the JSON schema and example prefill response with stop sequences.
output_start = """<output>\n{\n"cars":"""
closing_bracket = "]\n}\n</output>"
json_prompt = """Ensure the output is valid JSON as it will be parsed 
                 using `json.loads()` in Python. 
                 It should be in the schema: 
                <output>
                {
                "cars": [
                    {
                    "model": "<model_name1>",
                    "color": "<color1>"
                    },
                    {
                    "model": "<model_name2>",
                    "color": "<color2>"
                    },
                    {
                    "model": "<model_name3>",
                    "color": "<color3>"
                    },
                    {
                    "model": "<model_name4>",
                    "color": "<color4>"
                    },
                    {
                    "model": "<model_name5>",
                    "color": "<color5>"
                    }
                ]
                }
                </output>
                """

# Define the prompt for the model.
prompt = f"""Provide an example of 5 cars with their color and models in JSON format enclosed in <output></output> XML tags.
            {json_prompt}"""

# Prefilled part of the response.
prefilled_response = output_start

# Define the system prompt.
system_prompt = "You are an AI language model that provides structured JSON outputs."

# Format the request payload using the model's native structure.
native_request = {
    "anthropic_version": "bedrock-2023-05-31",
    "max_tokens": 1024,
    "temperature": 0.01,
    "stop_sequences": ["\n\nHuman:", closing_bracket],
    "system": f"<system>{system_prompt}</system>",
    "messages": [
        {
            "role": "user",
            "content": [{"type": "text", "text": prompt}],
        },
        {
            "role": "assistant",
            "content": [{"type": "text", "text": prefilled_response}]
        }
    ],
}

# Convert the native request to JSON.
request = json.dumps(native_request)

try:
    # Invoke the model with the request.
    response = client.invoke_model(modelId=model_id, body=request)

    # Decode the response body.
    model_response = json.loads(response["body"].read())

    # Extract and print the response text.
    completion = model_response["content"][0]["text"]
    final_result = prefilled_response + completion + closing_bracket

    print(final_result)

except (ClientError, Exception) as e:
    print(f"ERROR: Can't invoke '{model_id}'. Reason: {e}")
    exit(1)

Output :

<output>
{
"cars":
[
    {
    "model": "Toyota Corolla",
    "color": "Silver"
    },
    {
    "model": "Honda Civic",
    "color": "Blue"
    },
    {
    "model": "Ford Mustang",
    "color": "Red"
    },
    {
    "model": "Chevrolet Camaro",
    "color": "Black"
    },
    {
    "model": "Nissan Altima",
    "color": "White"
    }
]
}
</output>

The salient features of this method are :

Prefilling the Response: "Put words in the LLM's mouth" by starting the assistant's response with the opening bracket { or other relevant beginning sequences as we have used above <output>\n{\n"cars":. This encourages the model to follow the expected format.
Strategic Stop Sequences: Define stop sequences ( like } or specific keywords, for example : ]\n}\n</output>. ) to prevent the LLM from adding extraneous text after the JSON.
Leveraging Tags for Complex Outputs: For multiple JSON objects, ask the output to be enclosed within unique tags ( e.g., <output>...</output> XML tags ). This allows for easy extraction using regular expressions.

Extracting the JSON response between XML tags :

When working with APIs or systems that return responses wrapped in XML tags, it becomes crucial to extract and utilize the JSON data embedded within those tags. Below, we'll explore methods to extract JSON data from XML tags both with and without the use of regular expressions (regex), followed by saving the extracted data to a JSON file.

Using Regular Expressions (Regex)

Regex can be a powerful tool for pattern matching and extraction. In this case, we can use regex to locate the JSON content within the specified XML tags.

import json
import re

def extract_json_with_regex(response: str):
    pattern = r"<output>(.*?)</output>"
    # Search for the pattern <output>...</output>
    match = re.search(pattern, response, re.DOTALL)

    if match:
        # Extract the content between the tags
        json_str = match.group(1).strip()
        try:
            # Parse the string to a JSON object
            json_data = json.loads(json_str)
            return json_data
        except json.JSONDecodeError:
            # Return None if JSON parsing fails
            return None
    # Return None if no match is found
    return None

In this function, re.search() is used to find the first occurrence of the pattern <output>...</output> in the response. If found, it extracts the content between these tags and attempts to parse it as JSON. If parsing fails, it returns None.

Without Using Regular Expressions

For scenarios where you prefer not to use regex, a more manual approach can be employed to achieve the same goal.

import json

def extract_json_without_regex(response: str):
    start_tag = "<output>"
    end_tag = "</output>"
    # Find the start and end indices of the tags
    start_index = response.find(start_tag)
    end_index = response.find(end_tag)

    if start_index != -1 and end_index != -1:
        # Adjust start index to get the content after the start tag
        start_index += len(start_tag)
        # Extract the content between the tags
        json_str = response[start_index:end_index].strip()
        try:
            # Parse the string to a JSON object
            json_data = json.loads(json_str)
            return json_data
        except json.JSONDecodeError:
            # Return None if JSON parsing fails
            return None
    # Return None if tags are not found
    return None

This function locates the starting and ending positions of the <output> ...</output> tags manually, extracts the content between them and attempts to parse it as JSON. Like the regex approach, it returns None if parsing fails or the tags are not found.

Saving Extracted JSON to a File
After extracting the JSON data, the next step is to save it to a file for further processing or record-keeping. The function below handles this task.

def save_json_to_file(json_data, file_name='output.json'):
    with open(file_name, 'w') as json_file:
        # Save the JSON data to the specified file with indentation for readability
        json.dump(json_data, json_file, indent=4)
        print(f"JSON data saved to {json_file.name}")

This utility function opens a file in write mode and uses json.dump() to write the JSON data to it, ensuring the output is formatted with an indentation of 4 spaces for better readability.

Final JSON result (output.json):

{
    "cars": [
        {
            "model": "Toyota Corolla",
            "color": "Silver"
        },
        {
            "model": "Honda Civic",
            "color": "Blue"
        },
        {
            "model": "Ford Mustang",
            "color": "Red"
        },
        {
            "model": "Chevrolet Camaro",
            "color": "Black"
        },
        {
            "model": "Nissan Altima",
            "color": "White"
        }
    ]
}

2. Validating and Repairing JSON Response

Despite employing the earlier techniques, minor syntax errors can occasionally disrupt the JSON structure. These errors can be addressed using the following methods:

We can fix these minor errors using some simple methods :

Requesting the LLM to Correct the JSON: Feed the malformed JSON back to the LLM and prompt it to correct the errors.
Utilizing JSON Repair Tools: Using tools like [json_repair](https://github.com/mangiucugna/json_repair) or [half-json](https://github.com/half-pie/half-json) can help correct these errors quickly.

The second method is generally more economical, faster, and reliable for straightforward cleanup tasks. In contrast, the first method may be more effective for addressing complex issues, albeit at the cost of additional time and an extra LLM call.

Example (using json-repair):

pip install json-repair

from json_repair import repair_json

cleaned_final_result = repair_json(final_result)

You can also use this library to completely replace json.loads():

import json_repair

decoded_object = json_repair.loads(json_string)

Example (Asking LLM to fix broken JSON) :

import boto3
import json
from botocore.exceptions import ClientError
from dotenv import load_dotenv
import os

# Load environment variables from a .env file
load_dotenv()

# AWS Bedrock setup with credentials and region from environment variables
session = boto3.Session(
    region_name=os.getenv("AWS_DEFAULT_REGION"),
    aws_access_key_id=os.getenv("AWS_ACCESS_KEY_ID"),
    aws_secret_access_key=os.getenv("AWS_SECRET_ACCESS_KEY"),
)

# Create a Bedrock Runtime client using the session
bedrock = session.client(service_name="bedrock-runtime")

# Create a Bedrock Runtime client in the AWS Region of your choice (hardcoded to 'us-east-1')
client = boto3.client("bedrock-runtime", region_name="us-east-1")

# Set the model ID for Claude
model_id = "anthropic.claude-3-haiku-20240307-v1:0"

# Define the prefill response with stop sequences.
output_start = "{"
closing_bracket = "\n}"

#Example of Broken/Invalid JSON 
json_prompt = 
"""
{
    "cars": [
        {
            "model": Toyota Corolla, # Missing quotes around the value
            "color": "Silver"
        },
        {
            "model": "Honda Civic",
            "color": "Blue          # Missing closing quote
        },
        {
            "model": "Ford Mustang",
            "color": "Red"
        },
        {
            model: Chevrolet Camaro, # Missing quotes around the key and value
            "color": 'Black"         # Mixed quotes, opening with ' and closing with "
        ,                            # Missing closing brace for the object
        {
            "model": "Nissan Altima",
            "color": "White          # Missing closing quote and closing brace for the object
        }
    ]                           
}    
"""

# Define the prompt for the model
prompt = f"""Fix the JSON below:\n{json_prompt}"""

# Prefilled part of the response
prefilled_response = output_start

# Generic System prompt for JSON Repairing via LLM.
system_prompt = """

### Instruction

Your task is to act as an expert JSON fixer and repairer. You are responsible for correcting any broken JSON and ensuring there are no syntax errors. The resulting JSON should be validated and easily parsed using `json.loads()` in Python.

### Context

JSON is built on two primary structures:
1. A collection of name/value pairs, realized in various languages as an object, record, struct, dictionary, hash table, keyed list, or associative array.
2. An ordered list of values, realized in most languages as an array, vector, list, or sequence.

These structures are supported by virtually all modern programming languages, making JSON a widely used data interchange format.

In JSON, the structures take the following forms:
- An **object** is an unordered set of name/value pairs. An object begins with a `{` (left brace) and ends with a `}` (right brace). Each name is followed by a `:` (colon) and the name/value pairs are separated by `,` (comma).
- An **array** is an ordered collection of values. An array begins with a `[` (left bracket) and ends with a `]` (right bracket). Values are separated by `,` (comma).

### Requirements
1. Repair only the JSON structure without changing or modifying any data or values of the keys.
2. Ensure that the data is accurately represented and properly formatted within the JSON structure.
3. The resulting JSON should be validated and able to be parsed using `json.loads()` in Python.

### Example

#### Broken JSON
{
    "name": "John Doe",
    "age": 30,
    "isStudent": false
    "courses": ["Math", "Science"]
    "address": {
        "street": "123 Main St",
        "city": "Anytown",
        "zipcode": "12345"
    }

#### Fixed JSON

{
    "name": "John Doe",
    "age": 30,
    "isStudent": false,
    "courses": ["Math", "Science"],
    "address": {
        "street": "123 Main St",
        "city": "Anytown",
        "zipcode": "12345"
    }
}

### Notes
- Pay close attention to missing commas, unmatched braces or brackets, and any other structural issues.
- Maintain the integrity of the data without making assumptions or altering the content.
- Ensure the output is clean, precise, and ready for parsing in Python.
"""

# Format the request payload using the model's native structure
native_request = {
    "anthropic_version": "bedrock-2023-05-31",
    "max_tokens": 1024,
    "temperature": 0.01,
    "stop_sequences": ["\n\nHuman:", closing_bracket],
    "system": f"<system>{system_prompt}</system>",
    "messages": [
        {
            "role": "user",
            "content": [{"type": "text", "text": prompt}],
        },
        {
            "role": "assistant",
            "content": [{"type": "text", "text": prefilled_response}]
        }
    ],
}

# Convert the native request to JSON
request = json.dumps(native_request)

try:
    # Invoke the model with the request
    response = client.invoke_model(modelId=model_id, body=request)

    # Decode the response body
    model_response = json.loads(response["body"].read())

    # Extract and print the response text
    completion = model_response["content"][0]["text"]
    final_result = prefilled_response + completion + closing_bracket

    print(final_result)

except (ClientError, Exception) as e:
    print(f"ERROR: Can't invoke '{model_id}'. Reason: {e}")
    exit(1)

Output (as JSON) :

{
    "cars": [
        {
            "model": "Toyota Corolla",
            "color": "Silver"
        },
        {
            "model": "Honda Civic",
            "color": "Blue"
        },
        {
            "model": "Ford Mustang",
            "color": "Red"
        },
        {
            "model": "Chevrolet Camaro",
            "color": "Black"
        },
        {
            "model": "Nissan Altima",
            "color": "White"
        }
    ]
}

Balanced Perspective

While these techniques can significantly improve the consistency of JSON output from LLMs, they are not foolproof. Potential challenges include:

Increased complexity in prompt design
Additional computational overhead for post-processing
Dependency on external libraries for validation

Moreover, ethical considerations such as data privacy and model biases should always be taken into account when deploying LLMs in production environments.

Actionable Insights

Start with a Clear JSON Template: Define the JSON structure and use it as a guide for the LLM with few-shot prompting examples.
Leverage Post-Processing Tools: Use tools like [json_repair](https://github.com/mangiucugna/json_repair) to correct minor syntax errors in the JSON output.
Iterate and Improve: Continuously refining our prompts and validation rules based on the output and feedback. By following these steps, we can ensure that our LLM consistently generates well-formatted JSON, making our AI-driven applications more reliable and efficient.

Conclusion

Generating perfectly formatted JSON from LLMs is a common yet challenging task. By guiding the JSON syntax, communicating its usage, and using validation tools like json-fixer, we can significantly improve the consistency and reliability of the output. By combining clear instructions, strategic prompting, and robust validation, we can transform our LLM interactions from a gamble into a reliable pipeline for structured data.
That's all for the day folks, Stay informed, iterate, and refine your approach to master the art of JSON generation from any LLM.

DEV Community: Rishab Dugar

Fine-Tune SLMs in Colab for Free : A 4-Bit Approach with Meta Llama 3.2

Getting Comfortable With some Core Concepts

SLMs vs. LLMs

1. Getting Started: Colab Setup

2. Loading a Model Efficiently

3. Introducing PEFT & LoRA

4. Preparing Your Dataset for Training

🦙 LLaMA 3.1: Chat Template Format

🔄 Converting Between Formats

Custom Dataset (CSV)

Configuration & Imports

Loading & Wrapping Data

Standardizing & Applying the Chat Template

5. Supervised Fine Tuning with SFTTrainer

🛠️ SFTTrainer

Key Components

6. Kicking-off the training

7. Inference & Saving Your Model

Conclusion

Further Reading & Resources

The Best Way to Create AWS Lambda λ Layers for Python 🐍 (Fast & Easy) : A Complete Guide

What Are AWS Lambda Layers?

Common Challenges

Step-by-Step Guide to Creating AWS Lambda Layers

1. Prepare Your Environment

2. Create the Directory Structure

3. Install Dependencies

4. Package the Layer

5. Deploy the Layer to AWS Lambda

6. Attach the Layer to Your Lambda Function

Test Your Lambda Function

Some important Tips :

Conclusion

References

PDF Extraction: Retrieving Text and Tables together using Python🐍

Understanding the Approach

sample_pdf_with_text_and_table.pdf - Google Drive

Prerequisites

Installing Required Libraries

Step-by-Step Explanation

Complete Code

Output :

Conclusion

Managing Chat History in AWS Bedrock Models: A Deep Dive into Llama 3 🦙 and Anthropic Claude 🤖

Understanding Prompt Tokens in Llama 3

Example: Structuring Prompts for Multi-Turn Conversation

Python Example: Invoking Llama 3 via AWS Bedrock

Challenges with Llama 3

Mastering Multi-Turn Conversations with Anthropic Claude on AWS Bedrock

APIs for Claude on Amazon Bedrock

Text Completion API (Claude v1 and v2.x)

Messages API (Claude v3)

Implementing Multi-Turn Conversations

Formatting Chat History

Constructing the Request Payload

Invoking Anthropic Claude

Explanation

Complete Code

Expected AI Response

Error Handling and Monitoring

Conclusion

Further Reading

Crafting Structured {JSON} Responses: Ensuring Consistent Output from any LLM 🦙🤖

The Problem

The Solution: A Multi-Layered Approach

1. Guiding the LLM with Clear Instructions

Extracting the JSON response between XML tags :

Using Regular Expressions (Regex)

Without Using Regular Expressions

2. Validating and Repairing JSON Response

Balanced Perspective

Actionable Insights

Conclusion