Groq Has a Free API — Here's How to Get 500+ Tokens/Second LLM Inference

#groq #ai #llm #api

First time I used Groq, I thought the response was cached. Llama 3 70B generating 500+ tokens per second. That's 10x faster than GPT-4 on OpenAI. It wasn't cached — it's just that fast.

What Groq Offers for Free

Groq free tier:

14,400 requests/day for smaller models
Llama 3 70B, 8B — free with rate limits
Mixtral 8x7B — free with rate limits
Gemma 7B — free
OpenAI-compatible API — drop-in replacement
500+ tokens/second — fastest LLM inference available
Custom LPU hardware (not GPU)

Quick Start

npm install groq-sdk

import Groq from 'groq-sdk';

const groq = new Groq({ apiKey: process.env.GROQ_API_KEY });

const response = await groq.chat.completions.create({
  model: 'llama3-70b-8192',
  messages: [{ role: 'user', content: 'Explain WebSockets in one paragraph' }],
  temperature: 0.7,
  max_tokens: 200
});

console.log(response.choices[0].message.content);
// Response arrives in ~0.3 seconds for 200 tokens

REST API

# Chat completion
curl 'https://api.groq.com/openai/v1/chat/completions' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "llama3-70b-8192",
    "messages": [{"role": "user", "content": "Write a Python quicksort"}],
    "temperature": 0
  }'

# Streaming
curl 'https://api.groq.com/openai/v1/chat/completions' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "llama3-8b-8192",
    "messages": [{"role": "user", "content": "Hello"}],
    "stream": true
  }'

# List models
curl 'https://api.groq.com/openai/v1/models' \
  -H 'Authorization: Bearer YOUR_API_KEY'

Use with OpenAI SDK

import OpenAI from 'openai';

// Just change baseURL and apiKey — everything else stays the same
const client = new OpenAI({
  baseURL: 'https://api.groq.com/openai/v1',
  apiKey: process.env.GROQ_API_KEY
});

const chat = await client.chat.completions.create({
  model: 'llama3-70b-8192',
  messages: [{ role: 'user', content: 'Hello Groq!' }]
});

Streaming

const stream = await groq.chat.completions.create({
  model: 'llama3-70b-8192',
  messages: [{ role: 'user', content: 'Write a short story about AI' }],
  stream: true
});

for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content;
  if (content) process.stdout.write(content);
}

JSON Mode

const result = await groq.chat.completions.create({
  model: 'llama3-70b-8192',
  messages: [{
    role: 'user',
    content: 'Extract entities from: "Apple CEO Tim Cook announced iPhone 16 in Cupertino". Return JSON with: company, person, product, location.'
  }],
  response_format: { type: 'json_object' },
  temperature: 0
});

const entities = JSON.parse(result.choices[0].message.content);

Speed Benchmark

Model: Llama 3 70B

Groq:     520 tokens/sec (time to first token: 0.2s)
Together: 80 tokens/sec  (time to first token: 0.8s)
OpenAI:   50 tokens/sec  (time to first token: 1.2s) [GPT-4o]
Anyscale: 40 tokens/sec  (time to first token: 1.5s)

Use Cases

Real-time chatbots — instant responses
Code generation — rapid iteration
Data extraction — parse thousands of documents fast
Content generation — draft articles in seconds
Agent frameworks — fast tool-calling loops

Need fast AI-powered scraping? Check out my web scraping actors on Apify — intelligent data extraction.

Need ultra-fast AI integration? Email me at spinov001@gmail.com.

DEV Community