First time I used Groq, I thought the response was cached. Llama 3 70B generating 500+ tokens per second. That's 10x faster than GPT-4 on OpenAI. It wasn't cached — it's just that fast.
What Groq Offers for Free
Groq free tier:
- 14,400 requests/day for smaller models
- Llama 3 70B, 8B — free with rate limits
- Mixtral 8x7B — free with rate limits
- Gemma 7B — free
- OpenAI-compatible API — drop-in replacement
- 500+ tokens/second — fastest LLM inference available
- Custom LPU hardware (not GPU)
Quick Start
npm install groq-sdk
import Groq from 'groq-sdk';
const groq = new Groq({ apiKey: process.env.GROQ_API_KEY });
const response = await groq.chat.completions.create({
model: 'llama3-70b-8192',
messages: [{ role: 'user', content: 'Explain WebSockets in one paragraph' }],
temperature: 0.7,
max_tokens: 200
});
console.log(response.choices[0].message.content);
// Response arrives in ~0.3 seconds for 200 tokens
REST API
# Chat completion
curl 'https://api.groq.com/openai/v1/chat/completions' \
-H 'Authorization: Bearer YOUR_API_KEY' \
-H 'Content-Type: application/json' \
-d '{
"model": "llama3-70b-8192",
"messages": [{"role": "user", "content": "Write a Python quicksort"}],
"temperature": 0
}'
# Streaming
curl 'https://api.groq.com/openai/v1/chat/completions' \
-H 'Authorization: Bearer YOUR_API_KEY' \
-H 'Content-Type: application/json' \
-d '{
"model": "llama3-8b-8192",
"messages": [{"role": "user", "content": "Hello"}],
"stream": true
}'
# List models
curl 'https://api.groq.com/openai/v1/models' \
-H 'Authorization: Bearer YOUR_API_KEY'
Use with OpenAI SDK
import OpenAI from 'openai';
// Just change baseURL and apiKey — everything else stays the same
const client = new OpenAI({
baseURL: 'https://api.groq.com/openai/v1',
apiKey: process.env.GROQ_API_KEY
});
const chat = await client.chat.completions.create({
model: 'llama3-70b-8192',
messages: [{ role: 'user', content: 'Hello Groq!' }]
});
Streaming
const stream = await groq.chat.completions.create({
model: 'llama3-70b-8192',
messages: [{ role: 'user', content: 'Write a short story about AI' }],
stream: true
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
if (content) process.stdout.write(content);
}
JSON Mode
const result = await groq.chat.completions.create({
model: 'llama3-70b-8192',
messages: [{
role: 'user',
content: 'Extract entities from: "Apple CEO Tim Cook announced iPhone 16 in Cupertino". Return JSON with: company, person, product, location.'
}],
response_format: { type: 'json_object' },
temperature: 0
});
const entities = JSON.parse(result.choices[0].message.content);
Speed Benchmark
Model: Llama 3 70B
Groq: 520 tokens/sec (time to first token: 0.2s)
Together: 80 tokens/sec (time to first token: 0.8s)
OpenAI: 50 tokens/sec (time to first token: 1.2s) [GPT-4o]
Anyscale: 40 tokens/sec (time to first token: 1.5s)
Use Cases
- Real-time chatbots — instant responses
- Code generation — rapid iteration
- Data extraction — parse thousands of documents fast
- Content generation — draft articles in seconds
- Agent frameworks — fast tool-calling loops
Need fast AI-powered scraping? Check out my web scraping actors on Apify — intelligent data extraction.
Need ultra-fast AI integration? Email me at spinov001@gmail.com.
Top comments (0)