Nitonde Auric Ergeson

Posted on Jun 6

I Fine-Tuned a 3B Model for Text-to-SQL and It Actually Works

#programming #beginners #learning #ai

I have been building a logistics SaaS on the side called Frachtdok. At some point I needed a way to let non-technical users query data without touching SQL. I looked at the usual options: GPT-4 via API, hosted solutions, the whole thing. They all felt like overkill, or too expensive, or both.

So I decided to fine-tune something small myself.

This is the story of how I built Antelope-textTosql, a Phi-2 3B model that converts plain English into SQL queries. It has 534 downloads so far, which honestly surprised me.

Why Phi-2?

At the time, Phi-2 was one of the more interesting small models around. Microsoft trained it to punch above its weight on reasoning tasks, and text-to-SQL is fundamentally a structured reasoning problem: you're mapping a question onto a schema.

The size mattered too. I wanted something that could run on modest hardware. Phi-2 at 2.7B parameters, quantized to 4-bit with QLoRA, fits comfortably on a single GPU in Colab.

I considered Llama 2 7B. But 7B felt like more than I needed, and the fine-tuning cost adds up. Phi-2 let me move faster.

The Dataset

I trained on Spider, a well-known text-to-SQL benchmark from Yale. It has around 7,000 training examples covering 200+ different database schemas: airports, concert halls, HR systems, you name it.

Spider is useful because it forces the model to generalize. You can't just memorize table names; you have to actually learn the mapping from question structure to SQL structure.

I kept the prompt format simple on purpose. Complex prompt engineering can mask model weakness. I wanted to know if the model itself was actually learning.

Training Setup

I used QLoRA: 4-bit quantization plus Low-Rank Adaptation. You freeze most of the model, add small trainable adapter layers, and fine-tune just those. Cheaper than full fine-tuning and good enough for focused tasks like this.

The key numbers:

Base model: microsoft/phi-2
Quantization: 4-bit (bitsandbytes)
LoRA rank: 16
LoRA alpha: 32
Target modules: q_proj, v_proj
Learning rate: 2e-4
Epochs: 3
Hardware: A100 on Google Colab
Training time: ~1.5 hours
Adapter size: ~21MB
Merged model size: ~5.56GB

One thing I got wrong the first time: I set the learning rate too high and the model started hallucinating table names confidently. Dropping to 2e-4 and adding a warmup schedule fixed it.

Does It Actually Work?

Here are some real examples:

Question: How many employees are there?
SQL: SELECT COUNT(*) FROM employees

Question: List all customers from Germany
SQL: SELECT * FROM customers WHERE country = 'Germany'

Question: What is the average salary by department?
SQL: SELECT department, AVG(salary) FROM employees GROUP BY department

It handles these well. Where it starts to struggle is complex multi-join queries with ambiguous column names. If you ask something like "show me orders that have more than three items across all categories" without a clear schema, you'll get creative but wrong SQL.

Spider itself has an easy, medium, hard, and extra hard split. My model is solid on easy and medium. Hard and extra hard are hit or miss.

What I'd Change

The prompt only takes a database name, not the actual schema. That's the biggest limitation. If you pass real table and column definitions, the model would handle complex queries much better. Models like RESDSQL do this and the accuracy difference is significant. I cut that corner to keep the interface simple and I probably shouldn't have.

I also published without running formal benchmarks. The Spider evaluation script gives you an exact match score and I plan to run it soon. Even a modest number is more useful than nothing.

One thing that surprised me: Spider's training data skews toward simple queries. I trained on the full set without compensating for that, so the model is better at easy things than it needs to be, and worse at hard things than it could be.

The Downloads

The model gets pulled about 15-25 times per day without any active promotion. I think it mostly shows up in HuggingFace search results for "text-to-sql" and people grab it to test.

534 downloads is not a lot in absolute terms. But for a personal model with no organization behind it, trained in 1.5 hours on a free Colab GPU, I'll take it.

What's Next

Two more models planned: text-to-regex and text-to-shell. Same idea, small model, one focused task. I'll write about those when they're done.

The model is at AuricErgeson/Antelope-textTosql on HuggingFace. MIT license. If something breaks, drop a comment there. I check it.

Auric Ergeson Nitonde is a software development apprentice (Azubi FIAE) in Germany, building logistics tools and publishing NLP models at huggingface.co/AuricErgeson.

DEV Community

I Fine-Tuned a 3B Model for Text-to-SQL and It Actually Works

Top comments (0)