DEV Community

Cover image for I Built a Brazilian Portuguese LLM from Scratch - Here's What I Learned
Richard Sakaguchi
Richard Sakaguchi

Posted on • Originally published at blog.sakaguchi.ia.br

I Built a Brazilian Portuguese LLM from Scratch - Here's What I Learned

I Built a Brazilian Portuguese LLM from Scratch - Here's What I Learned

Yoshii IA Banner

The Problem

Most AI models are trained on English data. Like, 90%+ English. Portuguese? Less than 2%. Brazilian Portuguese? Even rarer.

This creates real problems:

Customer: "Tô de boa, só quero dar uma olhada"
(Translation: "I'm cool, just browsing")

GPT: "I don't understand. Could you rephrase?"
Enter fullscreen mode Exit fullscreen mode

Or worse:

Customer: "Vocês aceitam PIX?"
(PIX = Brazil's instant payment system, used by 150M+ people)

GPT: "What is PIX?"
Enter fullscreen mode Exit fullscreen mode

The Solution: Yoshii IA

I fine-tuned Mistral-7B on real Brazilian customer service conversations to create a model that actually understands:

  • 🇧🇷 Brazilian slang and expressions
  • 💳 Local context (PIX, CPF, CEP)
  • 🗣️ Natural conversational Portuguese
  • 🤝 Customer service best practices

The Technical Journey

1. Data Collection

I built a dataset of 10,000+ real customer service conversations in Brazilian Portuguese:

  • WhatsApp business chats
  • Support tickets
  • E-commerce interactions
  • Various industries (healthcare, retail, restaurants)

Dataset is open source: HuggingFace Dataset

2. Training Setup

Used QLoRA (Quantized LoRA) to fine-tune on consumer hardware:

# 4-bit quantization + LoRA
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4"
)

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05
)
Enter fullscreen mode Exit fullscreen mode

Hardware: Single RTX 3090 (24GB VRAM)
Training time: ~4 hours
VRAM usage: ~6GB with 4-bit quantization

3. Results

Before (GPT-4):

User: "E aí, tudo certo?"
Bot: "Olá! Como posso ajudá-lo hoje?"
Enter fullscreen mode Exit fullscreen mode

After (Yoshii IA):

User: "E aí, tudo certo?"
Bot: "E aí! Tudo certinho sim! E você? 😊 Em que posso ajudar?"
Enter fullscreen mode Exit fullscreen mode

Open Source Everything

🤗 Model: yoshii-ai/Yoshii-7B-BR
📊 Dataset: brazilian-customer-service-conversations
🌐 Live Demo: yoshii.sakaguchi.ia.br
💬 WhatsApp Bot: Production-ready, handling real customers

What's Next

  • 📢 Voice support (STT + TTS)
  • 📊 Sentiment analysis for Portuguese
  • 🔮 Predictive customer support
  • 🤖 Multi-agent orchestration

Lessons Learned

  1. Data quality > quantity - 10K well-curated samples beat 100K messy ones
  2. Cultural context matters - PIX, CPF, CEP aren't just words, they're institutions
  3. QLoRA is magic - Fine-tuning 7B models on consumer GPUs? Yes please
  4. Portuguese isn't one language - Brazilian Portuguese ≠ European Portuguese

Try It

# Quick inference
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("yoshii-ai/Yoshii-7B-BR")
tokenizer = AutoTokenizer.from_pretrained("yoshii-ai/Yoshii-7B-BR")

prompt = "Cliente: Oi, quero saber do meu pedido\nAtendente:"
output = model.generate(tokenizer.encode(prompt, return_tensors="pt"))
print(tokenizer.decode(output[0]))
Enter fullscreen mode Exit fullscreen mode

Building for your local market? Open source your work. The community will thank you. 🙌

Questions? Comments? Let me know below!


📍 São Paulo, Brazil
🔗 sakaguchi.ia.br
💬 WhatsApp

Top comments (0)