Part 3: AI Integration & Best Practices
Using AI for Data Engineering
AI tools help data engineers by:
- Generating workflows faster - Describe tasks in natural language
- Avoiding errors - Get syntax-correct code following best practices
Key Insight: AI is only as good as the context you provide.
Context Engineering with LLMs
Problem: Generic AI assistants (like ChatGPT without context) may produce:
- Outdated plugin syntax
- Incorrect property names
- Hallucinated features that don't exist
Why? LLMs are trained on data up to a knowledge cutoff date and don't know about software updates.
Solution: Provide proper context to AI!
Kestra AI Copilot
Kestra's built-in AI Copilot is designed specifically for generating Kestra flows with:
- Full context about latest plugins
- Correct workflow syntax
- Current best practices
Setup Requirements:
- Get Gemini API key from Google AI Studio
- Configure in docker-compose.yml with
GEMINI_API_KEY - Access via sparkle icon (✨) in Kestra UI
Retrieval Augmented Generation (RAG)
RAG is a technique that:
- Retrieves relevant information from data sources
- Augments the AI prompt with this context
- Generates responses grounded in real data
RAG Process in Kestra:
- Ingest documents (documentation, release notes)
- Create embeddings (vector representations)
- Store embeddings in KV Store or vector database
- Query with context at runtime
- Generate accurate, context-aware responses
RAG Best Practices:
- Keep documents updated regularly
- Chunk large documents appropriately
- Test retrieval quality
Deployment & Production
For production deployment:
- Deploy Kestra on Google Cloud
- Sync workflows from Git repository
- Use Secrets and KV Store for sensitive data
- Never commit API keys to Git
Troubleshooting Tips
| Issue | Solution |
|---|---|
| Port conflict with pgAdmin | Change Kestra port to 18080 |
| CSV column mismatch in BigQuery | Rerun entire execution including re-download |
| Container issues | Stop, remove, and restart containers |
Recommended Docker Images:
-
kestra/kestra:v1.1(stable version) postgres:18
Additional Resources
- Kestra Documentation
- Blueprints Library - Pre-built workflow examples
- 600+ Plugins
- Kestra Slack Community
Key Takeaways
- Workflow orchestration is essential for managing complex data pipelines
- Kestra provides a flexible, scalable solution with YAML-based flows
- ETL is ideal for local processing; ELT leverages cloud computing power
- Scheduling and backfills enable automated and historical data processing
- AI Copilot accelerates workflow development with proper context
- RAG eliminates AI hallucinations by grounding responses in real data #dezoomcamp
Top comments (0)