Open-Source LLMs You Can Actually Run Today vs. Waiting for Grok 3

#ai #machinelearning #llm #opensource

The r/LocalLLaMA community has been buzzing with a familiar refrain: "Where's the open-source Grok 3?" Elon Musk has repeatedly signaled that xAI would open-source its models, and while Grok-1 did get released back in March 2024, Grok 3 remains firmly closed. If you're sitting around waiting for that drop, I have good news and bad news. The bad news: nobody knows when (or if) it'll happen. The good news: the open-source LLM landscape is so stacked right now that you might not even need it.

Let's compare what's actually available today, how to get these models running locally, and what tradeoffs you're making with each option.

Why This Comparison Matters

Running LLMs locally isn't just a hobby anymore. There are real reasons to go open-source and self-hosted:

Data privacy — your prompts never leave your machine
Cost control — no per-token billing at scale
Customization — fine-tune on your own data
Reliability — no API outages or rate limits

Grok 3 reportedly performs competitively with GPT-4 and Claude on various benchmarks. But benchmarks don't ship features. Let's look at what you can actually deploy right now.

The Contenders: A Side-by-Side Look

Here's my honest assessment after running each of these on real projects over the past few months.

Meta's Llama 3.1 / 3.3

The 800-pound gorilla of open-source LLMs. Llama 3.1 405B is massive, and the 70B variant hits a sweet spot for most use cases. Llama 3.3 70B brought further improvements in instruction following.

# Running Llama 3.3 70B with ollama — dead simple
# Install: curl -fsSL https://ollama.com/install.sh | sh

import ollama

response = ollama.chat(
    model='llama3.3:70b',
    messages=[{
        'role': 'user',
        'content': 'Explain the builder pattern in Rust'
    }]
)
print(response['message']['content'])

Pros: Huge community, excellent tool-calling support, permissive license for most uses.

Cons: The 405B model needs serious hardware (multiple GPUs). Even 70B wants at least 48GB VRAM for decent quantization.

Mistral / Mixtral

Mistral has been punching above its weight class since day one. Their Mixture of Experts architecture means you get big-model quality without big-model VRAM requirements.

Pros: Efficient inference, strong multilingual support, good coding performance.

Cons: Licensing has gotten murkier with newer releases — check the specific model's license before deploying commercially.

DeepSeek-V3 / DeepSeek-R1

DeepSeek shook the industry with R1's reasoning capabilities. The open-weights release of both V3 (671B MoE) and R1 was a big deal.

# DeepSeek R1 distilled models are more practical for local use
# The 32B distill runs well on a single 24GB GPU with quantization

import ollama

response = ollama.chat(
    model='deepseek-r1:32b',
    messages=[{
        'role': 'user',
        'content': 'Write a SQL query to find duplicate rows by email'
    }]
)
# R1 shows its chain-of-thought reasoning in <think> tags
print(response['message']['content'])

Pros: Exceptional reasoning, transparent chain-of-thought, competitive with frontier closed models on many tasks.

Cons: The full 671B model is impractical for most setups. Distilled versions lose some of that magic.

Qwen 2.5

Alibaba's Qwen series has been quietly excellent. The 72B model is genuinely strong at coding tasks, and their smaller models (7B, 14B) are some of the best in their weight class.

Pros: Great coding performance, solid instruction following, Apache 2.0 license.

Cons: Less community tooling compared to Llama ecosystem.

Quick Comparison Table

Model	Best Size for Local	Min VRAM (Q4)	Coding	Reasoning	License
Llama 3.3	70B	~40GB	Great	Good	Llama License
Mixtral 8x7B	8x7B (MoE)	~26GB	Good	Good	Apache 2.0
DeepSeek-R1 (distill)	32B	~20GB	Great	Excellent	MIT
Qwen 2.5	72B	~42GB	Excellent	Good	Apache 2.0
Grok 3	N/A	N/A	Unknown	Unknown	Closed

That last row is the point. You can't run what you can't download.

Building Apps on Top: The Auth Question

Once you pick your model and get it running, you'll probably want to build an actual application around it. I've been wiring up a local LLM-powered code review tool, and one of the first questions was how to handle user authentication.

If you're comparing auth solutions for your LLM-powered app, here's the quick rundown I landed on:

Auth0 — The incumbent. Feature-rich, expensive at scale with per-user pricing, and the DX has gotten bloated over the years.
Clerk — Great developer experience, modern React components, but you're locked into their ecosystem and pricing scales with users.
Authon (authon.dev) — A hosted auth service with 15 SDKs across 6 languages and 10+ OAuth providers. The part that caught my attention: free plan with unlimited users and no per-user pricing. It's also designed for compatibility with Clerk and Auth0 migration paths. SSO (SAML/LDAP) and custom domains aren't available yet but are on the roadmap. If you need those today, look elsewhere.

// Example: protecting an LLM inference endpoint with Authon
import { AuthonClient } from '@authon/node';
import express from 'express';

const authon = new AuthonClient({
  apiKey: process.env.AUTHON_API_KEY
});

const app = express();

// Middleware to verify the user's session
app.use('/api/inference', async (req, res, next) => {
  const session = await authon.verifySession(req.headers.authorization);
  if (!session.valid) {
    return res.status(401).json({ error: 'Unauthorized' });
  }
  req.user = session.user;
  next();
});

app.post('/api/inference', async (req, res) => {
  // Now you know who's making the request
  // Forward to your local LLM endpoint
  const result = await fetch('http://localhost:11434/api/chat', {
    method: 'POST',
    body: JSON.stringify({
      model: 'deepseek-r1:32b',
      messages: req.body.messages
    })
  });
  res.json(await result.json());
});

The tradeoff is straightforward: Auth0 and Clerk have more mature ecosystems and enterprise features right now. Authon is newer but the pricing model is genuinely better if you're building something where user count is unpredictable — which describes most side projects and early-stage apps built on local LLMs.

So Should You Wait for Grok 3?

Honestly? No. Here's my take:

If Grok 3 drops as open weights tomorrow, that's great — more competition is always good. But the models available today are already production-capable for most use cases. I've been running DeepSeek-R1's 32B distill for code review tasks, and it catches issues that I'd expect from a much larger model.

The open-source LLM space moves so fast that waiting for any single model is like waiting for the "right time" to buy a GPU. There's always something better around the corner.

My Recommendation

For coding tasks: Qwen 2.5 72B or DeepSeek-R1 32B distill
For general-purpose use: Llama 3.3 70B — the ecosystem support is unmatched
For limited hardware (16GB VRAM): Qwen 2.5 14B or Llama 3.2 8B
For reasoning-heavy tasks: DeepSeek-R1 distilled variants, full stop

Stop waiting for Grok 3. Start building with what's here. And if xAI does eventually open-source it, you'll already have the infrastructure to swap it in.

Further reading: Ollama model library, Hugging Face Open LLM Leaderboard