Your AI portfolio has a chatbot, a sentiment analyzer, and a fine-tuned model on Hugging Face. So does every other candidate's.
Hiring managers in 2026 scan for production signals. They want to see how you handle failures, structure data, connect systems, and ship working software. A Jupyter notebook with model.predict() tells them nothing about how you build real systems.
These 5 projects are different. Each one teaches a specific production skill that AI engineering teams actually need. They're ordered from simplest to most complex. Build all 5 and you'll have a portfolio that demonstrates retrieval, structured outputs, autonomous agents, evaluation, and deployment — the exact skills showing up in job descriptions right now.
Project 1: Document Q&A With RAG
What it proves: You can connect an LLM to real data, not just training data.
RAG (Retrieval-Augmented Generation) is the most in-demand AI engineering skill in 2026. Every company building with LLMs needs someone who can ground model outputs in actual documents.
# pip install langchain-chroma langchain-openai langchain-community
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Load and chunk documents
loader = PyPDFLoader("company_handbook.pdf")
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
)
chunks = splitter.split_documents(docs)
# Store in vector database
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=OpenAIEmbeddings(),
persist_directory="./chroma_db",
)
# Query with context
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
results = retriever.invoke("What is the PTO policy?")
llm = ChatOpenAI(model="gpt-4o-mini")
context = "\n\n".join([doc.page_content for doc in results])
answer = llm.invoke(
f"Answer based on this context only:\n{context}\n\nQuestion: What is the PTO policy?"
)
print(answer.content)
What makes this portfolio-worthy: Don't stop at the basic pipeline. Add these three features that separate you from tutorials:
- Chunk quality scoring — Log which chunks the retriever returns. Are they relevant? Add a relevance filter.
- Source attribution — Show the user exactly which document page the answer came from.
- Failure handling — What happens when no relevant chunks exist? Return "I don't have enough information" instead of hallucinating.
Hiring manager signal: "This candidate understands that RAG isn't just similarity_search(). They handle edge cases."
Project 2: Structured Data Extraction
What it proves: You can get reliable, typed outputs from LLMs — not raw text.
Every production AI system needs structured outputs. Parsing free text with regex is fragile. Modern LLMs can return validated Pydantic objects directly.
# pip install langchain-openai pydantic
from pydantic import BaseModel, Field
from langchain_openai import ChatOpenAI
class JobPosting(BaseModel):
"""Structured representation of a job posting."""
title: str = Field(description="Job title")
company: str = Field(description="Company name")
salary_min: int | None = Field(
default=None, description="Minimum salary in USD"
)
salary_max: int | None = Field(
default=None, description="Maximum salary in USD"
)
remote: bool = Field(description="Whether the role is remote")
required_skills: list[str] = Field(
description="List of required technical skills"
)
llm = ChatOpenAI(model="gpt-4o-mini")
structured_llm = llm.with_structured_output(JobPosting)
raw_text = """
Senior ML Engineer at DataCorp. $180k-$220k.
Fully remote. Must know Python, PyTorch,
distributed training, and Kubernetes.
"""
result = structured_llm.invoke(
f"Extract the job posting details:\n{raw_text}"
)
print(result.title) # "Senior ML Engineer"
print(result.salary_min) # 180000
print(result.required_skills) # ["Python", "PyTorch", ...]
What makes this portfolio-worthy: Build a batch processor that extracts structured data from 100+ job postings. Show:
-
Validation logic — What happens when the LLM returns
salary_minas a string? Pydantic catches it. -
Batch efficiency — Process multiple postings concurrently with
asyncio.gather(). - Accuracy metrics — Compare extracted data against manually labeled samples. Report precision and recall.
Hiring manager signal: "This candidate knows that LLM outputs are unreliable by default. They validate and measure."
Project 3: Tool-Calling AI Agent
What it proves: You can build autonomous systems that take actions, not just generate text.
Agents are the fastest-growing category in AI engineering. A tool-calling agent decides which function to call based on user input — and handles the result.
# pip install langgraph langchain-openai
from langchain_openai import ChatOpenAI
from langchain_core.tools import tool
from langgraph.prebuilt import create_react_agent
@tool
def get_weather(city: str) -> str:
"""Get current weather for a city."""
# In production: call a real weather API
weather_data = {
"London": "15°C, cloudy",
"Tokyo": "22°C, sunny",
"New York": "8°C, rain",
}
return weather_data.get(city, f"No data for {city}")
@tool
def convert_temperature(celsius: float) -> str:
"""Convert Celsius to Fahrenheit."""
fahrenheit = (celsius * 9 / 5) + 32
return f"{celsius}°C = {fahrenheit}°F"
llm = ChatOpenAI(model="gpt-4o-mini")
tools = [get_weather, convert_temperature]
agent = create_react_agent(llm, tools)
# The agent decides which tools to call
result = agent.invoke(
{"messages": [{"role": "user", "content": "What's the weather in Tokyo? Convert it to Fahrenheit."}]}
)
for message in result["messages"]:
print(f"{message.type}: {message.content}")
What makes this portfolio-worthy: Add complexity that mirrors real production agents:
- Multi-step reasoning — Give it a task that requires 3+ tool calls in sequence.
- Error recovery — What happens when a tool call fails? Add retry logic with exponential backoff.
-
Conversation memory — Use
MemorySaverfrom LangGraph so the agent remembers previous interactions.
from langgraph.checkpoint.memory import MemorySaver
memory = MemorySaver()
agent = create_react_agent(llm, tools, checkpointer=memory)
# First message
config = {"configurable": {"thread_id": "user-123"}}
agent.invoke(
{"messages": [{"role": "user", "content": "What's the weather in London?"}]},
config=config,
)
# Follow-up — agent remembers the context
agent.invoke(
{"messages": [{"role": "user", "content": "Convert that to Fahrenheit"}]},
config=config,
)
Hiring manager signal: "This candidate builds agents that recover from failures and maintain state. Not a one-shot demo."
Project 4: LLM Evaluation Pipeline
What it proves: You can measure whether your AI system actually works.
Most AI engineers ship without evaluation. The ones who get hired know how to write tests for non-deterministic systems. DeepEval integrates with pytest to make this practical.
# pip install deepeval
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import (
AnswerRelevancyMetric,
HallucinationMetric,
)
def test_rag_answer_relevancy():
"""Test that RAG answers are relevant to the question."""
test_case = LLMTestCase(
input="What is the refund policy?",
actual_output="Our refund policy allows returns within 30 days of purchase with a valid receipt.",
retrieval_context=[
"Refund Policy: Customers may return items within 30 days of purchase. A valid receipt is required.",
],
)
metric = AnswerRelevancyMetric(threshold=0.7)
assert_test(test_case, [metric])
def test_no_hallucination():
"""Test that the model doesn't hallucinate beyond the context."""
test_case = LLMTestCase(
input="What is the refund policy?",
actual_output="Our refund policy allows returns within 30 days. We also offer free shipping on all orders.",
retrieval_context=[
"Refund Policy: Customers may return items within 30 days of purchase.",
],
)
metric = HallucinationMetric(threshold=0.5)
# This should FAIL — "free shipping" is hallucinated
assert_test(test_case, [metric])
Run it like any other test:
deepeval test run test_evaluation.py
What makes this portfolio-worthy:
- Test your own Project 1 — Write evaluation tests for your RAG pipeline. Measure hallucination rates across 50+ test cases.
- CI integration — Add DeepEval to a GitHub Actions workflow. Block merges when hallucination rate exceeds your threshold.
- Regression tracking — Show how your RAG pipeline improved over time. Before: 23% hallucination rate. After tuning chunk size and retriever: 4%.
Hiring manager signal: "This candidate doesn't just build AI — they prove it works. They think about reliability."
Project 5: Deploy Everything Behind an API
What it proves: You can ship AI as a service, not a notebook.
The gap between "works on my laptop" and "deployed and callable" is where most candidates stop. Bridge it.
# pip install fastapi uvicorn
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
app = FastAPI(title="AI Portfolio API")
class QuestionRequest(BaseModel):
question: str
document_id: str | None = None
class AnswerResponse(BaseModel):
answer: str
sources: list[str]
confidence: float
@app.post("/ask", response_model=AnswerResponse)
async def ask_question(request: QuestionRequest):
try:
# Connect to your RAG pipeline from Project 1
results = retriever.invoke(request.question)
if not results:
raise HTTPException(
status_code=404,
detail="No relevant documents found",
)
context = "\n\n".join([doc.page_content for doc in results])
answer = llm.invoke(
f"Answer based on this context only:\n{context}\n\nQuestion: {request.question}"
)
return AnswerResponse(
answer=answer.content,
sources=[doc.metadata.get("source", "unknown") for doc in results],
confidence=0.85,
)
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@app.get("/health")
async def health_check():
return {"status": "healthy", "model": "gpt-4o-mini"}
What makes this portfolio-worthy:
-
Dockerfile — Containerize the entire stack. One
docker compose upto run everything. -
Rate limiting — Add
slowapito prevent abuse. Show you think about production concerns. -
Monitoring — Add a
/metricsendpoint that tracks request count, latency, and error rate.
FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
Hiring manager signal: "This candidate ships. They don't just prototype — they deploy."
How to Present These Projects
A GitHub repo with code is necessary but not sufficient. Here's what separates hired candidates from ignored ones:
Write a README that answers three questions:
- What does this project do? (One sentence.)
- How do I run it? (
pip install+ one command.) - What did you learn? (The interesting engineering decisions you made.)
Include a DECISIONS.md file. Document why you chose ChromaDB over Pinecone. Why you used gpt-4o-mini instead of gpt-4o. Why your chunk size is 1,000 tokens. Hiring managers want to see your reasoning, not just your code.
Record a 2-minute Loom video. Walk through the project running. Show the terminal output. Explain one interesting failure you encountered and how you fixed it. This alone puts you ahead of 90% of applicants.
Link your projects together. Project 5 wraps Project 1. Project 4 tests Project 1. Project 3 uses the same patterns as Project 2. A portfolio with connected projects shows systems thinking.
The Bottom Line
The AI job market in 2026 rewards builders over learners. Hiring managers see hundreds of "completed the LLM course" portfolios every week. They hire the candidates who show production instincts: error handling, evaluation, deployment, and structured thinking.
Build these 5 projects. Connect them. Deploy them. Document your decisions. That's a portfolio that gets callbacks.
Follow @klement_gunndu for more AI engineering content. We're building in public.
Top comments (0)