What is Groq?
Groq is an AI inference company that built custom hardware (LPU — Language Processing Unit) specifically for running LLMs. The result: 500+ tokens/second output speed, making it 10-18x faster than OpenAI. And they offer a generous free tier.
Why Groq is a Game-Changer
- Free tier — generous rate limits for development
- 500+ tokens/sec — responses feel instant (GPT-4 does ~30 tokens/sec)
- OpenAI-compatible API — drop-in replacement
- Llama 3, Mixtral, Gemma — all major open-source models
- Custom LPU hardware — not GPUs, purpose-built for inference
Quick Start
pip install groq
from groq import Groq
client = Groq(api_key="your-api-key") # Free at console.groq.com
response = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[{"role": "user", "content": "Explain microservices vs monolith in 3 sentences"}],
temperature=0.7
)
print(response.choices[0].message.content)
# Response arrives in <1 second for short prompts
OpenAI Drop-In Replacement
from openai import OpenAI
# Change ONE line — all your code works
client = OpenAI(
base_url="https://api.groq.com/openai/v1",
api_key="your-groq-key"
)
response = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[{"role": "user", "content": "Write a Python async web scraper"}]
)
Streaming (Real-Time Output)
stream = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[{"role": "user", "content": "Build a complete FastAPI CRUD app with SQLAlchemy"}],
stream=True
)
for chunk in stream:
content = chunk.choices[0].delta.content
if content:
print(content, end="", flush=True)
# Full code output in 2-3 seconds instead of 30-60 with GPT-4
Tool Use / Function Calling
tools = [{
"type": "function",
"function": {
"name": "search_database",
"description": "Search for products in the database",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string"},
"max_results": {"type": "integer", "default": 10}
},
"required": ["query"]
}
}
}]
response = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[{"role": "user", "content": "Find all red shoes under $50"}],
tools=tools
)
JSON Mode (Structured Output)
response = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[{
"role": "user",
"content": "Extract entities from: Apple released iPhone 16 in September 2024 for $799"
}],
response_format={"type": "json_object"}
)
# Returns clean JSON: {"company": "Apple", "product": "iPhone 16", "date": "September 2024", "price": 799}
Speed Comparison
| Provider | Model | Output Speed | Latency |
|---|---|---|---|
| Groq | Llama 3 70B | 500+ tok/s | <0.5s |
| Together AI | Llama 3 70B | 80 tok/s | ~1s |
| OpenAI | GPT-4 Turbo | 30 tok/s | ~2s |
| Anthropic | Claude 3 | 40 tok/s | ~1.5s |
| Fireworks | Llama 3 70B | 100 tok/s | ~0.8s |
Real-World Use Case
A real-time coding assistant needed sub-second response times. OpenAI took 5-10 seconds per code completion — developers lost flow. Switching to Groq with Llama 3 70B: completions arrive in 0.5-1 second. Developer productivity went up 40% because they stopped context-switching while waiting for AI.
Building real-time AI applications? I help teams optimize inference pipelines for speed and cost. Contact spinov001@gmail.com or explore my automation tools on Apify.
Top comments (0)