A practical demonstration showing how to leverage S3 Vectors (Vector Store) and Cohere Embeddings to provide data teams with contextual, historical fixes directly within failed Airflow task logs
As Data Engineer, we spend too much time hunting through logs when workflow fails, specially when you are the newbie member in the team, What if orchestration was not only about automation but also learning from failures and guiding your team through them ?
In this article, I'll show you how to build an intelligent Airflow DAG that uses AWS S3 vectors, embeddings, and vector search to capture historical failure wisdom and actionable fix hints to try to avoid the manual log debugging.
Stack
- Airflow
- S3 Vector
- AWS Bedrock
- cohere.embed-v4:0
What Problem are we solving ?
Everybody knows that Airflow workflows can inevitably fail in anytime due to several reasons like data quality, dependency conflicts, permission errors, timeout, whatever. Traditionally, engineers looking for the logs as first step, but what if exist a better option to do that.
This project pretend uses Cohere (AWS Bedrock) embeddings and S3 vectors to index past errors and search for similar failure patterns. It's mean once any specific task fails, we :
- Capture the error
- Create an error summary dict
- Generate a semantic vector embedding
- Query a vector index stored in S3 vectors
- Retrieve and suggest the most relevant solution
How it Works
At a high level, I built a simulated DAG to demonstrate the idea by generating common errors on purpose so you can clearly see the value because in a real project you will face your own unique, varied, and countless failures, in that case the Airflow DAG simulates failures for common problem types, I propose those one :
- Division by Zero
- Data validation failures (syntax)
- S3 permissions
- Database connections issues (Timeout)
That being said, when a task fails, the error message is captured and embedded using a LLM model hosted on AWS Bedrock, that embedding is used to query a vector index in S3 Vectors, which stores previously seen errors and their solutions (read the NOTE section), S3 vectors lets you perform similarity search directly on S3 without managing a separate vector database, finally in the task call 'hint_to_solve' the system returns the closest match and suggests the corresponding solution right in the Airflow logs, here is an example of the DAG functionality :
NOTE: Data ingestion into the S3 Vector Index is out of scope for this article, as it is straightforward and well covered in the 👉🏻 AWS documentation.
For reference, the simulated error records ingested into the index are available here:
https://github.com/alexbonella/Airflow-S3-Vector-Guide/blob/main/airflow_simulation_error.json
How looks like a hints DAG 👇🏻 ?
[2025-12-16, 19:02:53 UTC] {local_task_job_runner.py:123} ▶ Pre task execution logs
[2025-12-16, 19:02:54 UTC] {smart_airflow_dag.py:206} INFO - INFO: Generating embedding for error: RuntimeError: Dependency 'requests' too old: 2.25.0 < 2.32.0 (at smart_airflow_dag.py, line 163)
[2025-12-16, 19:02:56 UTC] {smart_airflow_dag.py:71} INFO - ✅ Embedding successfully generated. Dimension: 1536
[2025-12-16, 19:02:56 UTC] {smart_airflow_dag.py:215} INFO - ⏳: Querying vector database for similar errors...
[2025-12-16, 19:02:57 UTC] {smart_airflow_dag.py:241} INFO - 💡 How to Solve this error: 👇🏻
[2025-12-16, 19:02:57 UTC] {smart_airflow_dag.py:242} INFO - {
"suggestion": "Update the 'requests' package in the `requirements.txt` file to a version greater than or equal to 2.32.0 and redeploy the environment.",
"similarity_score": 0.0162
}
[2025-12-16, 19:02:57 UTC] {smart_airflow_dag.py:243} INFO -
[2025-12-16, 19:02:57 UTC] {python.py:240} INFO - Done. Returned value was: None
[2025-12-16, 19:02:57 UTC] {taskinstance.py:349} ▶ Post task execution logs
Why This Matters
Traditional Airflow error handling is reactive and manual but with semantic search over historical errors:
- Your data team save time on debugging, specially the new members
- Organizational knowledge errors are codified and reusable
- Workflows become self-aware and proactive instead of being in orchestration zombie mode
Turns Failures into Knowledge
This guide demonstrates a real-world use case for embedding searchable failure knowledge directly into Airflow. So if you're leading or scaling a data team, imagine the impact of pipelines (Dags) that do not just report errors, but they guide you to solve them.
Feel free to check out the complete code and adapt it to your environment or models!
👉 GitHub: https://github.com/alexbonella/Airflow-S3-Vector-Guide

Top comments (0)