DEV Community

brian austin
brian austin

Posted on

I built a $2/month AI assistant and hosted it myself — here's the full architecture

I built a $2/month AI assistant and hosted it myself — here's the full architecture

I got tired of the token-counting anxiety. Every time I used the Claude API directly, I was watching the meter tick: 1,000 tokens here, 5,000 tokens there. A long debugging session could cost $3-4 in a single sitting.

So I built a flat-rate wrapper. Same Claude model underneath. Fixed $2/month. No per-token billing.

Here's how it actually works.

The architecture

User browser
    ↓ HTTPS
Node.js server (VPS, 2GB RAM)
    ↓ Auth middleware (JWT)
Session manager (rate limiting per user)
    ↓ Anthropic SDK
Claude API
    ↑ Response
Streaming back to browser
Enter fullscreen mode Exit fullscreen mode

That's it. The magic isn't in the architecture — it's in the business model.

The key technical pieces

1. Rate limiting per user

The most important component. Without this, one heavy user can burn through your entire API budget in a day.

// Simple in-memory rate limiter
// For production: use Redis
const userLimits = new Map();

function checkRateLimit(userId) {
  const now = Date.now();
  const windowMs = 60 * 60 * 1000; // 1 hour window
  const maxRequests = 50; // per hour

  if (!userLimits.has(userId)) {
    userLimits.set(userId, { count: 0, resetAt: now + windowMs });
  }

  const limit = userLimits.get(userId);

  if (now > limit.resetAt) {
    limit.count = 0;
    limit.resetAt = now + windowMs;
  }

  if (limit.count >= maxRequests) {
    return { allowed: false, resetAt: limit.resetAt };
  }

  limit.count++;
  return { allowed: true };
}
Enter fullscreen mode Exit fullscreen mode

2. Streaming responses

Users expect real-time output. Nobody wants to wait 10 seconds for a full response to appear at once.

app.post('/api/chat', authenticate, async (req, res) => {
  const { message, history } = req.body;

  // Set streaming headers
  res.setHeader('Content-Type', 'text/event-stream');
  res.setHeader('Cache-Control', 'no-cache');
  res.setHeader('Connection', 'keep-alive');

  const stream = await anthropic.messages.stream({
    model: 'claude-3-5-sonnet-20241022',
    max_tokens: 2048,
    messages: history.concat([{ role: 'user', content: message }])
  });

  for await (const chunk of stream) {
    if (chunk.type === 'content_block_delta') {
      res.write(`data: ${JSON.stringify({ text: chunk.delta.text })}\n\n`);
    }
  }

  res.write('data: [DONE]\n\n');
  res.end();
});
Enter fullscreen mode Exit fullscreen mode

3. Conversation history management

Claude doesn't have memory between API calls. You need to send the full conversation history each time — but you need to trim it or costs explode.

function trimHistory(history, maxTokenEstimate = 4000) {
  // Rough estimate: 1 token ≈ 4 characters
  const charLimit = maxTokenEstimate * 4;

  let totalChars = 0;
  const trimmed = [];

  // Walk backwards, keep recent messages that fit
  for (let i = history.length - 1; i >= 0; i--) {
    const msgChars = history[i].content.length;
    if (totalChars + msgChars > charLimit) break;
    trimmed.unshift(history[i]);
    totalChars += msgChars;
  }

  return trimmed;
}
Enter fullscreen mode Exit fullscreen mode

4. Auth with JWT

const jwt = require('jsonwebtoken');

function authenticate(req, res, next) {
  const token = req.headers.authorization?.replace('Bearer ', '');

  if (!token) return res.status(401).json({ error: 'No token' });

  try {
    req.user = jwt.verify(token, process.env.JWT_SECRET);
    next();
  } catch {
    res.status(401).json({ error: 'Invalid token' });
  }
}
Enter fullscreen mode Exit fullscreen mode

The infrastructure cost breakdown

Component Monthly cost
VPS (2GB RAM, Hetzner) €4.51 (~$5)
Anthropic API budget $40-60
Domain + SSL ~$1 amortized
Total per month ~$65
Revenue at 50 users $100
Profit margin 35%

The model works because most users are occasional users. At $2/month, you're not power users who run Claude 8 hours a day — you're developers who want AI on demand without commitment.

What I'd do differently

Redis for rate limiting instead of in-memory. When the server restarts, in-memory limits reset. Redis survives restarts and scales across multiple instances.

Per-user token budgets tracked in a database. Right now rate limiting is request-based (50 requests/hour). Better: track actual tokens per user per month and enforce a ceiling.

Model routing — use Claude Haiku for short factual queries, Sonnet for longer reasoning tasks. Haiku costs ~10x less per token. Automatic model selection based on prompt length and complexity could cut API costs by 40%.

Is it worth building vs buying?

If you want to run this yourself: the Anthropic API, a $5/month Hetzner VPS, and about 200 lines of Node.js gets you there. The code above covers ~80% of what you need.

If you just want access without the infra headache: SimplyLouie is what I run for others. $2/month, same Claude model, no server to maintain. Free 7-day trial, card required but not charged until day 8.

What's your setup?

Are you running your own Claude wrapper? Using the raw API with token budgets? Or just paying full price for ChatGPT Plus?

I'm curious what cost-control strategies developers are actually using in production — drop them in the comments.

claude #ai #webdev #tutorial #discuss

Top comments (0)