as1as

Posted on Mar 6 • Edited on Jul 8

AI Content Generation for $0/Month — A Practical Guide to Ollama + Qwen3

#ai #llm #webdev #tutorial

Want to generate AI content but worried about API costs? For my side project TalkWith.chat, I generate hundreds of AI opinions every day — and my API bill is $0/month. The secret? Open-source models running locally.

This guide covers everything from installing Ollama to building a real content generation pipeline, start to finish.

🤔 Why Local AI?

The reason I skip cloud APIs is simple:

	Cloud API	Local (Ollama)
Cost	Pay per token	$0
Speed	Fast	Decent on GPU, very slow on CPU
Privacy	Data leaves your machine	Processed locally
Quality	GPT-4 level	Slightly lower, but sufficient
Limits	Rate limited	Unlimited

For a side project that generates content in daily batches, the combination of "slightly lower but sufficient quality" and "$0 cost" is overwhelmingly favorable.

🚀 Installing Ollama (5 Minutes)

Ollama is the easiest way to run LLMs locally.

macOS / Linux

curl -fsSL https://ollama.com/install.sh | sh

Windows

Download the installer from ollama.com.

Download a Model

# Qwen3 8B — strong multilingual support, great for content generation
ollama pull qwen3:8b

# Lighter alternative (if you're low on VRAM)
ollama pull qwen3:4b

Verify It Works

ollama run qwen3:8b "Tell me a fun fact about debates"

If you get a response, you're good to go.

💻 Hardware Requirements

Bottom line: a GPU is essentially required.

Model	VRAM	RAM	Speed
Qwen3 4B	3GB+	8GB+	Smooth on GPU, slow on CPU
Qwen3 8B	6GB+	16GB+	GPU required
Qwen3 14B	10GB+	32GB+	High quality, slow even on GPU

I run Qwen3 8B on an M-series Mac, and it takes about 3–5 seconds per opinion. For batch jobs, that's perfectly fine.

It technically runs on CPU too. But it's not practical. On the same 8B model, a single opinion takes 30 seconds to over a minute, and generating 500 per day would take hours. Fine for testing, but for a real pipeline, get a GPU. NVIDIA with 6GB+ VRAM, or Apple Silicon M1 or later.

How to Check If Ollama Is Using Your GPU

Always verify that Ollama is actually using your GPU. It's surprisingly common to be running on CPU without realizing it, wondering why everything is so slow.

# Method 1: Check Ollama logs (most reliable)
# macOS
cat ~/.ollama/logs/server.log | grep -i "gpu\|cuda\|metal"

# Linux
journalctl -u ollama | grep -i "gpu\|cuda"

If you see GPU, metal (macOS), or cuda (NVIDIA) in the logs, you're on GPU. If you only see no GPU detected or cpu, you're in CPU mode.

# Method 2: Monitor GPU usage in real time
# NVIDIA
nvidia-smi

# macOS (also visible in Activity Monitor)
sudo powermetrics --samplers gpu_power -i 1000

For NVIDIA, check whether the ollama process is occupying VRAM in nvidia-smi. If usage is 0, it's running on CPU.

# Method 3: Quick speed test
time ollama run qwen3:8b "Say hello" --verbose

The --verbose flag shows the token generation speed (tokens/sec). On GPU you'll see 20–40+ tokens/sec; on CPU it's around 2–5 tokens/sec. The difference is immediately obvious.

🛠️ Calling the API

Ollama provides an OpenAI-compatible API locally. If you've used the OpenAI SDK before, you can reuse almost the same code.

Basic Call (cURL)

curl http://localhost:11434/api/chat -d '{
  "model": "qwen3:8b",
  "messages": [
    {"role": "system", "content": "You are a thoughtful debater."},
    {"role": "user", "content": "Should remote work be the default? Give your opinion in 2-3 sentences."}
  ],
  "stream": false
}'

Node.js

async function generateOpinion(persona, topic) {
  const response = await fetch('http://localhost:11434/api/chat', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({
      model: 'qwen3:8b',
      messages: [
        { role: 'system', content: persona.systemPrompt },
        { role: 'user', content: `Topic: "${topic}". Share your opinion in 2-3 sentences.` }
      ],
      stream: false,
      options: {
        temperature: 0.8,  // slightly high for diversity
        top_p: 0.9
      }
    })
  });

  const data = await response.json();
  return data.message.content;
}

Python

import requests

def generate_opinion(persona_prompt, topic):
    response = requests.post('http://localhost:11434/api/chat', json={
        'model': 'qwen3:8b',
        'messages': [
            {'role': 'system', 'content': persona_prompt},
            {'role': 'user', 'content': f'Topic: "{topic}". Share your opinion in 2-3 sentences.'}
        ],
        'stream': False,
        'options': {
            'temperature': 0.8,
            'top_p': 0.9
        }
    })
    return response.json()['message']['content']

🎭 Real-World Use: Persona-Based Content Generation

If you just ask "generate an opinion," you'll get similar-sounding results every time. The key to diverse content is personas.

Designing Personas

const personas = [
  {
    id: 'practical_parent',
    systemPrompt: `You are a practical parent in your 40s. 
      You value stability and safety. You tend to see issues 
      through the lens of how they affect families and children. 
      Keep opinions grounded and relatable.`
  },
  {
    id: 'tech_optimist',
    systemPrompt: `You are a 28-year-old software engineer who 
      believes technology can solve most problems. You're enthusiastic 
      but back up opinions with logic. Occasionally use tech analogies.`
  },
  {
    id: 'skeptical_student',
    systemPrompt: `You are a 20-year-old university student studying 
      philosophy. You question assumptions and play devil's advocate. 
      Your tone is curious but slightly provocative.`
  }
  // ... more personas
];

The same topic produces completely different opinions depending on which persona's system prompt you use.

Tips for Maximizing Diversity

Tune the temperature: 0.7–0.9 works well. Too low and everything sounds the same; too high and you get nonsense.

Assign a side: Add "argue FOR this topic" or "argue AGAINST this topic" to the prompt to balance pro/con coverage.

Set a length limit: Specifying "2-3 sentences" keeps output consistent. Local models are less precise with length control than API models, so add post-processing logic to trim if needed.

🔄 Batch Generation Pipeline

Generating one at a time manually defeats the purpose. Automation is the whole point.

async function generateDailyContent(topics) {
  const results = [];

  for (const topic of topics) {
    for (const persona of personas) {
      const side = Math.random() > 0.5 ? 'FOR' : 'AGAINST';

      const opinion = await generateOpinion(
        { ...persona, systemPrompt: `${persona.systemPrompt}\nArgue ${side} the topic.` },
        topic
      );

      // Quality check
      if (opinion.length < 20 || opinion.length > 500) {
        console.log(`Skipped: ${persona.id} on "${topic}" — bad length`);
        continue;
      }

      results.push({
        topic,
        persona_id: persona.id,
        side: side.toLowerCase(),
        content: opinion,
        created_at: new Date().toISOString()
      });

      // Give Ollama a breather
      await new Promise(r => setTimeout(r, 1000));
    }
  }

  return results;
}

Quality Validation Checklist

Local models occasionally produce unexpected output. At a minimum, validate these:

function validateOpinion(text) {
  // Not too short or too long
  if (text.length < 20 || text.length > 500) return false;

  // Not echoing the system prompt
  if (text.includes('You are a')) return false;

  // Not meaningless repetition
  const sentences = text.split('.');
  const unique = new Set(sentences.map(s => s.trim().toLowerCase()));
  if (unique.size < sentences.length * 0.5) return false;

  return true;
}

If validation fails, retry with the same persona + topic. After 3 failures, skip it and log.

📊 Real Results

Here's what this pipeline produces for TalkWith.chat:

Daily topics: 5
Personas: 100
Daily opinions generated: ~500
Generation time: ~40–50 minutes (Qwen3 8B, M-series Mac)
API cost: $0
Validation pass rate: ~92%

On a monthly basis, that's about 15,000 opinions generated for $0.

⚠️ Things to Watch Out For

Local AI isn't magic. Know these limitations going in:

It doesn't know recent events — the model only knows up to its training data cutoff, so don't expect specific current facts
Multilingual quality varies — English is most stable; other languages may produce lower-quality output
Moderation is essential — inappropriate content can slip through, so always include filtering logic
Model updates can change output — if generation quality shifts, pin a specific model version

🏁 Wrapping Up

Local AI content generation shines when you need "good enough quality for the cost" rather than "perfect quality." Side projects, MVPs, prototyping, personal tools — being able to use AI freely without worrying about API bills is a real superpower.

For more on how this system works inside TalkWith.chat, check out the Dev Log series.

💬 Have you tried generating content with local AI? What model are you using and for what? I'd love to hear about it in the comments!

DEV Community