Want to generate AI content but worried about API costs? For my side project TalkWith.chat, I generate hundreds of AI opinions every day — and my API bill is $0/month. The secret? Open-source models running locally.
This guide covers everything from installing Ollama to building a real content generation pipeline, start to finish.
🤔 Why Local AI?
The reason I skip cloud APIs is simple:
| Cloud API | Local (Ollama) | |
|---|---|---|
| Cost | Pay per token | $0 |
| Speed | Fast | Decent on GPU, very slow on CPU |
| Privacy | Data leaves your machine | Processed locally |
| Quality | GPT-4 level | Slightly lower, but sufficient |
| Limits | Rate limited | Unlimited |
For a side project that generates content in daily batches, the combination of "slightly lower but sufficient quality" and "$0 cost" is overwhelmingly favorable.
🚀 Installing Ollama (5 Minutes)
Ollama is the easiest way to run LLMs locally.
macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
Windows
Download the installer from ollama.com.
Download a Model
# Qwen3 8B — strong multilingual support, great for content generation
ollama pull qwen3:8b
# Lighter alternative (if you're low on VRAM)
ollama pull qwen3:4b
Verify It Works
ollama run qwen3:8b "Tell me a fun fact about debates"
If you get a response, you're good to go.
💻 Hardware Requirements
Bottom line: a GPU is essentially required.
| Model | VRAM | RAM | Speed |
|---|---|---|---|
| Qwen3 4B | 3GB+ | 8GB+ | Smooth on GPU, slow on CPU |
| Qwen3 8B | 6GB+ | 16GB+ | GPU required |
| Qwen3 14B | 10GB+ | 32GB+ | High quality, slow even on GPU |
I run Qwen3 8B on an M-series Mac, and it takes about 3–5 seconds per opinion. For batch jobs, that's perfectly fine.
It technically runs on CPU too. But it's not practical. On the same 8B model, a single opinion takes 30 seconds to over a minute, and generating 500 per day would take hours. Fine for testing, but for a real pipeline, get a GPU. NVIDIA with 6GB+ VRAM, or Apple Silicon M1 or later.
How to Check If Ollama Is Using Your GPU
Always verify that Ollama is actually using your GPU. It's surprisingly common to be running on CPU without realizing it, wondering why everything is so slow.
# Method 1: Check Ollama logs (most reliable)
# macOS
cat ~/.ollama/logs/server.log | grep -i "gpu\|cuda\|metal"
# Linux
journalctl -u ollama | grep -i "gpu\|cuda"
If you see GPU, metal (macOS), or cuda (NVIDIA) in the logs, you're on GPU. If you only see no GPU detected or cpu, you're in CPU mode.
# Method 2: Monitor GPU usage in real time
# NVIDIA
nvidia-smi
# macOS (also visible in Activity Monitor)
sudo powermetrics --samplers gpu_power -i 1000
For NVIDIA, check whether the ollama process is occupying VRAM in nvidia-smi. If usage is 0, it's running on CPU.
# Method 3: Quick speed test
time ollama run qwen3:8b "Say hello" --verbose
The --verbose flag shows the token generation speed (tokens/sec). On GPU you'll see 20–40+ tokens/sec; on CPU it's around 2–5 tokens/sec. The difference is immediately obvious.
🛠️ Calling the API
Ollama provides an OpenAI-compatible API locally. If you've used the OpenAI SDK before, you can reuse almost the same code.
Basic Call (cURL)
curl http://localhost:11434/api/chat -d '{
"model": "qwen3:8b",
"messages": [
{"role": "system", "content": "You are a thoughtful debater."},
{"role": "user", "content": "Should remote work be the default? Give your opinion in 2-3 sentences."}
],
"stream": false
}'
Node.js
async function generateOpinion(persona, topic) {
const response = await fetch('http://localhost:11434/api/chat', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
model: 'qwen3:8b',
messages: [
{ role: 'system', content: persona.systemPrompt },
{ role: 'user', content: `Topic: "${topic}". Share your opinion in 2-3 sentences.` }
],
stream: false,
options: {
temperature: 0.8, // slightly high for diversity
top_p: 0.9
}
})
});
const data = await response.json();
return data.message.content;
}
Python
import requests
def generate_opinion(persona_prompt, topic):
response = requests.post('http://localhost:11434/api/chat', json={
'model': 'qwen3:8b',
'messages': [
{'role': 'system', 'content': persona_prompt},
{'role': 'user', 'content': f'Topic: "{topic}". Share your opinion in 2-3 sentences.'}
],
'stream': False,
'options': {
'temperature': 0.8,
'top_p': 0.9
}
})
return response.json()['message']['content']
🎭 Real-World Use: Persona-Based Content Generation
If you just ask "generate an opinion," you'll get similar-sounding results every time. The key to diverse content is personas.
Designing Personas
const personas = [
{
id: 'practical_parent',
systemPrompt: `You are a practical parent in your 40s.
You value stability and safety. You tend to see issues
through the lens of how they affect families and children.
Keep opinions grounded and relatable.`
},
{
id: 'tech_optimist',
systemPrompt: `You are a 28-year-old software engineer who
believes technology can solve most problems. You're enthusiastic
but back up opinions with logic. Occasionally use tech analogies.`
},
{
id: 'skeptical_student',
systemPrompt: `You are a 20-year-old university student studying
philosophy. You question assumptions and play devil's advocate.
Your tone is curious but slightly provocative.`
}
// ... more personas
];
The same topic produces completely different opinions depending on which persona's system prompt you use.
Tips for Maximizing Diversity
Tune the temperature: 0.7–0.9 works well. Too low and everything sounds the same; too high and you get nonsense.
Assign a side: Add "argue FOR this topic" or "argue AGAINST this topic" to the prompt to balance pro/con coverage.
Set a length limit: Specifying "2-3 sentences" keeps output consistent. Local models are less precise with length control than API models, so add post-processing logic to trim if needed.
🔄 Batch Generation Pipeline
Generating one at a time manually defeats the purpose. Automation is the whole point.
async function generateDailyContent(topics) {
const results = [];
for (const topic of topics) {
for (const persona of personas) {
const side = Math.random() > 0.5 ? 'FOR' : 'AGAINST';
const opinion = await generateOpinion(
{ ...persona, systemPrompt: `${persona.systemPrompt}\nArgue ${side} the topic.` },
topic
);
// Quality check
if (opinion.length < 20 || opinion.length > 500) {
console.log(`Skipped: ${persona.id} on "${topic}" — bad length`);
continue;
}
results.push({
topic,
persona_id: persona.id,
side: side.toLowerCase(),
content: opinion,
created_at: new Date().toISOString()
});
// Give Ollama a breather
await new Promise(r => setTimeout(r, 1000));
}
}
return results;
}
Quality Validation Checklist
Local models occasionally produce unexpected output. At a minimum, validate these:
function validateOpinion(text) {
// Not too short or too long
if (text.length < 20 || text.length > 500) return false;
// Not echoing the system prompt
if (text.includes('You are a')) return false;
// Not meaningless repetition
const sentences = text.split('.');
const unique = new Set(sentences.map(s => s.trim().toLowerCase()));
if (unique.size < sentences.length * 0.5) return false;
return true;
}
If validation fails, retry with the same persona + topic. After 3 failures, skip it and log.
📊 Real Results
Here's what this pipeline produces for TalkWith.chat:
- Daily topics: 5
- Personas: 100
- Daily opinions generated: ~500
- Generation time: ~40–50 minutes (Qwen3 8B, M-series Mac)
- API cost: $0
- Validation pass rate: ~92%
On a monthly basis, that's about 15,000 opinions generated for $0.
⚠️ Things to Watch Out For
Local AI isn't magic. Know these limitations going in:
- It doesn't know recent events — the model only knows up to its training data cutoff, so don't expect specific current facts
- Multilingual quality varies — English is most stable; other languages may produce lower-quality output
- Moderation is essential — inappropriate content can slip through, so always include filtering logic
- Model updates can change output — if generation quality shifts, pin a specific model version
🏁 Wrapping Up
Local AI content generation shines when you need "good enough quality for the cost" rather than "perfect quality." Side projects, MVPs, prototyping, personal tools — being able to use AI freely without worrying about API bills is a real superpower.
For more on how this system works inside TalkWith.chat, check out the Dev Log series.
💬 Have you tried generating content with local AI? What model are you using and for what? I'd love to hear about it in the comments!
Top comments (0)