Aniket Hingane

Posted on Jun 30

Building an Intelligent Fleet Maintenance Compliance Coordinator with LangGraph

#python #langgraph #ai #fleet

How I Automated DOT Inspection Triage Using Multi-Agent Business Rules

Part 1 — Context and Foundations

TL;DR

I built FleetGuard-AI as a personal proof-of-concept to explore a problem I kept noticing while experimenting with fleet operations patterns: maintenance and compliance data is easy to collect from telematics, ELD exports, and shop management systems, but the hard part is deciding what each vehicle record means when DOT inspection windows, brake wear curves, tire tread rules, PM mileage intervals, and open defect reports all interact at dispatch time. This experimental project routes fleet maintenance records through a LangGraph-orchestrated multi-agent pipeline backed by FastAPI and a lightweight FMCSA-style policy knowledge index. Each record passes through DOT inspection validation, brake and tire checks, preventive maintenance gates, ELD hour limits, defect documentation review, compliance retrieval, and action generation before landing in a structured report with urgency levels and agent traces. The full source code lives at https://github.com/aniket-work/FleetGuard-AI. From my experience, the most interesting takeaway is not any single agent prompt, but how graph-based orchestration keeps specialist compliance steps composable without turning business logic into spaghetti. This write-up documents my design choices, the code I wrote along the way, and what I would change if I extended the PoC further.

Introduction

A few weeks ago I started paying closer attention to what happens after a morning dispatch review surfaces a Class 8 tractor with twelve percent brake pad life remaining, a steer-axle tire at three and a half thirty-seconds tread depth, and a DOT inspection that is three hundred forty days old. From my observation reading fleet compliance forums and anonymized maintenance exports, the numbers themselves are rarely the whole story. A dispatcher still has to answer harder questions. Is this vehicle legally out of service or merely approaching a warning threshold? Does the steer-axle note change tire minimums? Which FMCSA policy clause applies, and how quickly should someone pull the unit before a roadside inspection turns into a citation?

I wanted to test whether multi-agent orchestration could compress that triage work without pretending to replace human judgment. FleetGuard-AI is the result of that experiment. It is not production software, not something I deployed at any employer, and not a claim about how commercial fleets should operate. It is a solo PoC I built to learn LangGraph patterns in a domain that felt concrete and underserved in agent tutorials.

The architecture rhymes with other validation experiments I have tried: specialist agents, a coordinator graph, a knowledge layer for policy retrieval, and a thin API for integration. I deliberately chose fleet maintenance compliance rather than chatbots or document summarization because those demos are overrepresented, and because roadside-ready violations have crisp numeric inputs and operational outputs that make evaluation honest.

What pulled me toward this domain specifically was the asymmetry between measurement and response. Modern fleets already emit odometer readings, brake wear percentages, tire tread measurements, ELD driving hours, and open DVIR defect counts from shop systems and telematics gateways. The gap I noticed was structured validation: turning raw maintenance records into urgency labels, compliance violation tags, policy citations, and recommended actions that a dispatcher could accept or override quickly. I wondered whether explicit agent roles could mirror that mental sorting without hiding reasoning inside a black-box model.

I also wanted a use case where mistakes are obvious. If an agent labels a tractor with eight percent brake pads as cleared for dispatch, tests should scream. That clarity helped me iterate faster than I would have in a vague business insights demo where quality is subjective.

From where I stand, fleet compliance validation shares a structural pattern with other enterprise-style agent experiments I have built: you receive structured events, apply domain-specific business rules in layers, attach policy citations, and emit operational recommendations with explicit urgency. I did not copy any prior tutorial wholesale. Instead, I asked what "validation" means when the asset is a regulated commercial vehicle rather than a shopping cart, then designed agents around that question.

What's This Article About?

This article walks through how I designed and implemented FleetGuard-AI, a multi-agent system that ingests batches of fleet maintenance records and emits compliance reports containing violation labels, urgency scores, FMCSA-style policy references, recommended interventions, and response hour estimates. You will see the LangGraph state machine I used, the specialist agents I wrote, the policy knowledge module that acts as a stand-in for retrieval-augmented generation, and the FastAPI surface that exposes the workflow to a simple dashboard.

I also cover setup and execution steps so you can reproduce the PoC locally, and I share edge cases I discovered while testing, including steer-axle tire thresholds that only tighten when dispatcher notes mention steer axle wear. Throughout, I frame conclusions as personal observations from an experimental build rather than authoritative guidance for regulated fleet environments.

Tech Stack

I kept the stack intentionally small. Complexity in agent systems often hides in dependencies long before it hides in prompts, and I wanted to feel every layer.

Python 3.11 as the runtime. Strong typing with Pydantic models made maintenance record batches easy to validate at API boundaries.
LangGraph for orchestration. I considered a hand-rolled coordinator loop, but LangGraph gave me explicit state transitions and a graph I could diagram without lying.
FastAPI for HTTP endpoints and to serve a minimal HTML dashboard from the same process.
Pydantic v2 for schemas such as MaintenanceRecord, ComplianceResult, and BatchComplianceReport.
Rich for CLI tables. Terminal output matters in fleet operations workflows because many dispatch tools still live partly in terminals and JSON exports.
Matplotlib for urgency and violation distribution charts. I wanted at least one visual summary of learned compliance statistics, not because the charts are fancy, but because they make batch behavior obvious at a glance.

I skipped a full vector database in this PoC. A lexical policy index was enough to prove the RAG insertion point, and it kept the repository approachable for readers who want to clone and run in ten minutes.

As per my experience shipping small services, I optimize for clone-to-green time in PoCs. Every extra infrastructure component is a reader who bounces at step four of the README. LangGraph plus FastAPI plus a JSON fixture gets people to a colored terminal table quickly. That table is the emotional payoff that convinces someone to read the rest of the design sections.

Component	Role in FleetGuard-AI	Why I chose it
LangGraph	Per-record compliance loop	Makes orchestration visible and testable
FastAPI	HTTP plus static dashboard	Single process dev experience
Pydantic	Schema guardrails	Prevents malformed record batches early
Rich	CLI tables	Matches dispatcher-friendly output
Matplotlib	Urgency and violation charts	Quant summary for README and article

Part 2 — Design Rationale

Why Read It?

If you are experimenting with LangGraph and tired of toy chatbots, fleet maintenance compliance offers a disciplined sandbox. Inputs are structured maintenance records with vehicle class metadata. Outputs are enums, policy codes, and action lists. That makes it easy to tell when an agent misbehaved without hand-waving about subjective quality.

From my perspective, three ideas in this PoC transfer to other domains:

Specialist agents beat monolithic prompts when each step has different failure modes. DOT inspection currency and brake wear scoring should not share one prompt context window.
Graph orchestration documents operational process better than nested if-statements. Dispatchers already think in workflows; LangGraph mirrors that honestly.
A thin API unlocks multiple clients with one workflow. I used both a Rich CLI and a browser dashboard against the same compliance function.

The GitHub repository at https://github.com/aniket-work/FleetGuard-AI includes tests, sample data, diagrams, and a title animation GIF generated from terminal output plus a dashboard preview. Clone it, run python main.py, and you will see the same ASCII table aesthetic I embedded in the article cover animation.

Let's Design

Before writing code I sketched the lifecycle of a single maintenance record. A fleet system reports odometer miles, days since last DOT inspection, miles since last PM service, brake pad percent remaining, tire tread depth in thirty-seconds of an inch, seven-day ELD driving hours, open defect counts, maintenance log completeness, and dispatcher notes. A compliance system must answer six questions: is the DOT annual inspection current, are brakes and tires within vehicle-class thresholds, is preventive maintenance overdue, are ELD hours within federal limits, are open defects and documentation gaps blocking dispatch, and what intervention urgency is appropriate. Those questions map cleanly to five agent responsibilities plus a coordinator.

Coordinator and state

The coordinator owns batch progress. LangGraph state carries the record list, accumulated results, fleet name, and a numeric index. I chose explicit index-based iteration rather than recursive graph tricks because it made debugging easier when a single record produced surprising output.

DOT inspection agent

The DOT inspection agent checks whether the annual inspection is overdue beyond three hundred sixty-five days or approaching the three hundred thirty-day warning window. Vehicles past the hard limit flag dot_inspection_overdue. Vehicles in the warning band flag dot_inspection_due_soon. I put inspection currency first because, in my opinion, operating an out-of-compliance vehicle is worse than delaying a load by a few hours.

Brake and tire agent

This agent encodes the mechanical intuition that actually grounds roadside inspections. Class 7 and Class 8 tractors carry different brake pad critical and warning thresholds. Tire tread minimums tighten when dispatcher notes mention steer axle wear, because steer tires face stricter federal minimums than drive tires. Encoding these as lookup tables made pytest assertions straightforward.

PM service agent

Preventive maintenance intervals differ by vehicle class. Refrigerated units use a fifteen thousand mile PM window instead of the twenty five thousand mile default for linehaul tractors. When miles since last PM exceed the limit, the agent flags pm_service_overdue. When the vehicle is within three thousand miles of the limit, it flags pm_service_due_soon.

ELD hours agent

This agent encodes a regulatory gate I rarely see in generic fleet demos: cumulative driving hours over a seven-day window cannot exceed sixty hours under standard rules. Above sixty triggers eld_hours_exceeded. Between fifty-five and sixty the agent logs an approaching-limit trace without flagging a violation yet.

Defect and documentation agent

Open DVIR defects, missing maintenance logs, and odometer discrepancies flagged in dispatcher notes each produce distinct violation types. A note mentioning "odometer discrepancy" or "mileage gap" triggers odometer_gap even when numeric fields look otherwise normal.

Compliance policy and action agents

The compliance policy agent queries a small FMCSA-style policy chunk index. Each chunk includes a code like FMCSA-DOT-001, title, body, and vehicle class tag. Search is lexical overlap with class and category boosts. The action agent converts urgency into recommended interventions and response hour counts. Critical brake wear triggers immediate pull-from-dispatch steps with a two-hour response window.

API and clients

FastAPI exposes /api/validate/sample for the bundled dataset and /api/validate for custom payloads. The root route serves a dark-themed dashboard that renders urgency cards and a results table after a POST call. I put the frontend in a single HTML file to avoid Node build tooling in a PoC README.

Part 3 — Implementation Deep Dive

Let's Get Cooking

Here is where I translate the design into code. I split the implementation into schema models, specialist agents, the LangGraph workflow, and the API layer. I wrote each block to be readable in isolation because, in my experience, agent repos rot quickly when everything lives in one thousand-line module.

Data models

I started with Pydantic models because maintenance records arrive from JSON exports and telematics gateways in real life, and I wanted validation before any agent touched the data. Strong typing also made FastAPI response models trivial to declare.

class UrgencyLevel(str, Enum):
    EMERGENCY = "emergency"
    HIGH = "high"
    MEDIUM = "medium"
    LOW = "low"
    NORMAL = "normal"

class ComplianceViolation(str, Enum):
    DOT_INSPECTION_OVERDUE = "dot_inspection_overdue"
    DOT_INSPECTION_DUE_SOON = "dot_inspection_due_soon"
    BRAKE_WEAR_CRITICAL = "brake_wear_critical"
    BRAKE_WEAR_WARNING = "brake_wear_warning"
    TIRE_TREAD_CRITICAL = "tire_tread_critical"
    TIRE_TREAD_WARNING = "tire_tread_warning"
    ODOMETER_GAP = "odometer_gap"
    PM_SERVICE_OVERDUE = "pm_service_overdue"
    PM_SERVICE_DUE_SOON = "pm_service_due_soon"
    ELD_HOURS_EXCEEDED = "eld_hours_exceeded"
    DEFECT_REPORT_OPEN = "defect_report_open"
    DOCUMENTATION_MISSING = "documentation_missing"

class MaintenanceRecord(BaseModel):
    record_id: str
    vehicle_id: str
    vehicle_class: VehicleClass
    odometer_miles: int = Field(ge=0, le=2_000_000)
    last_dot_inspection_days_ago: int = Field(ge=0, le=730)
    last_pm_service_miles_ago: int = Field(ge=0, le=500_000)
    brake_pad_percent_remaining: float = Field(ge=0.0, le=100.0)
    tire_tread_depth_32nds: float = Field(ge=0.0, le=32.0)
    eld_driving_hours_7day: float = Field(ge=0.0, le=80.0)
    open_defect_count: int = Field(ge=0, le=50)
    has_maintenance_log: bool = True
    recorded_at: str
    dispatcher_notes: str = ""

These models became the contract between CLI, API, and agents. Keeping enums strict prevented silent string drift in urgency labels when I generated charts later. I designed dispatcher_notes as a first-class field because steer-axle tire logic and odometer gap detection depend on human context that pure sensor readings cannot capture.

DOT inspection agent

The DOT inspection agent is the first specialist in the pipeline. It answers the simplest regulatory question first: has this vehicle exceeded the annual inspection window?

DOT_INSPECTION_MAX_DAYS = 365
DOT_INSPECTION_WARN_DAYS = 330

def dot_inspection_agent(record: MaintenanceRecord, trace: list[str]) -> list[ComplianceViolation]:
  violations: list[ComplianceViolation] = []

  if record.last_dot_inspection_days_ago > DOT_INSPECTION_MAX_DAYS:
      violations.append(ComplianceViolation.DOT_INSPECTION_OVERDUE)
  elif record.last_dot_inspection_days_ago > DOT_INSPECTION_WARN_DAYS:
      violations.append(ComplianceViolation.DOT_INSPECTION_DUE_SOON)

  trace.append(
      f"DotInspectionAgent: days_since_inspection={record.last_dot_inspection_days_ago}, "
      f"violations={len(violations)}"
  )
  return violations

I put this agent first because inspection currency is a binary compliance gate. A vehicle three hundred seventy-two days past inspection, like refrigerated unit REF-3301 in my sample batch, should never slip through because brake wear looked acceptable. The trace line records both the raw day count and violation count so I can audit decisions in the CLI without re-running the graph.

Brake and tire agent

Mechanical wear is where fleet business logic actually lives. Class 8 tractors tolerate higher brake pad wear before critical thresholds than light-duty vans. Steer-axle notes tighten tire minimums because federal steer tire rules are stricter than drive tire rules.

BRAKE_CRITICAL: dict[VehicleClass, float] = {
    VehicleClass.CLASS_7: 15.0,
    VehicleClass.CLASS_8: 20.0,
    VehicleClass.LIGHT_DUTY: 10.0,
    VehicleClass.REFRIGERATED: 18.0,
}

def brake_tire_agent(record: MaintenanceRecord, trace: list[str]) -> list[ComplianceViolation]:
    violations: list[ComplianceViolation] = []
    brake_crit = BRAKE_CRITICAL[record.vehicle_class]
    brake_warn = BRAKE_WARNING[record.vehicle_class]

    if record.brake_pad_percent_remaining < brake_crit:
        violations.append(ComplianceViolation.BRAKE_WEAR_CRITICAL)
    elif record.brake_pad_percent_remaining < brake_warn:
        violations.append(ComplianceViolation.BRAKE_WEAR_WARNING)

    steer_note = "steer" in record.dispatcher_notes.lower()
    tire_crit = STEER_TIRE_CRITICAL_32NDS if steer_note else TIRE_CRITICAL_32NDS

    if record.tire_tread_depth_32nds < tire_crit:
        violations.append(ComplianceViolation.TIRE_TREAD_CRITICAL)
    elif record.tire_tread_depth_32nds < TIRE_WARNING_32NDS:
        violations.append(ComplianceViolation.TIRE_TREAD_WARNING)

    trace.append(
        f"BrakeTireAgent: brake={record.brake_pad_percent_remaining}%, "
        f"tread={record.tire_tread_depth_32nds}/32, steer_axle={steer_note}"
    )
    return violations

This block is where I learned the most during testing. TRK-1042 in the sample data reports twelve percent brake pads on a Class 8 unit with steer-axle notes and three and a half thirty-seconds tread. Both brake critical and tire tread warning fire. Without the steer note, tread at three and a half would only trigger a warning, not critical. Dispatcher context changed the outcome, which is exactly the behavior I wanted.

The compliance policy agent wraps policy search across detected violations:

def compliance_policy_agent(record, violations, trace) -> list[str]:
    hits = search_policies(query, vehicle_class=record.vehicle_class.value, category=category, top_k=1)
    refs = [f"{hit.code}: {hit.title}" for hit in hits]
    trace.append(f"CompliancePolicyAgent: attached {len(refs)} policy references")
    return refs

The action agent returns both steps and response hours. That pairing is what makes the output feel operational instead of academic.

LangGraph workflow

The graph is small but worth showing because it is the spine of the PoC:

def build_compliance_graph() -> Any:
    graph = StateGraph(ComplianceState)
    graph.add_node("load", _load_batch)
    graph.add_node("process", _process_next)
    graph.add_node("summarize", _summarize)
    graph.set_entry_point("load")
    graph.add_edge("load", "process")
    graph.add_conditional_edges(
        "process",
        _should_continue,
        {"process": "process", "summarize": "summarize"},
    )
    graph.add_edge("summarize", END)
    return graph.compile()

Each pass through process validates one maintenance record and increments the index. When the index reaches the batch length, the graph routes to summarize, which is a hook I left open for aggregate analytics such as average confidence or violation histograms.

run_batch_compliance computes urgency counts and violation distributions after the graph finishes. That summary feeds both the Rich CLI table and the Matplotlib chart in src/fleetguard/analytics.py.

return BatchComplianceReport(
    fleet_name=fleet_name,
    processed=len(results),
    emergency_count=counts[UrgencyLevel.EMERGENCY],
    high_count=counts[UrgencyLevel.HIGH],
    medium_count=counts[UrgencyLevel.MEDIUM],
    low_count=counts[UrgencyLevel.LOW],
    normal_count=counts[UrgencyLevel.NORMAL],
    results=results,
    summary_stats={
        "avg_confidence": round(sum(r.confidence for r in results) / len(results), 2),
        "avg_response_hours": round(sum(r.response_hours for r in results) / len(results), 1),
        "violation_distribution": violation_distribution,
    },
)

The summary statistics feed the dashboard cards and the Matplotlib chart in the README. When I first omitted violation_distribution, the UI felt empty even though per-row validation was correct. Aggregates matter for human scanability.

FastAPI surface

The API is intentionally thin. It validates payloads, calls run_batch_compliance, and returns a BatchComplianceReport. The sample endpoint loads data/sample_fleet_records.json, which describes eight records across Class 7, Class 8, light-duty, and refrigerated units: TRK-1042 with critical brake wear and steer-axle tire concern, TRK-2087 cleared for regional route, REF-3301 with overdue DOT inspection, VAN-0091 with missing maintenance paperwork, TRK-1156 with critical brakes, critical tires, ELD hours exceeded, open defects, and odometer discrepancy, TRK-1893 with DOT window closing and PM overdue, TRK-2204 nominal for long-haul, and REF-3412 routine refrigerated check.

@app.post("/api/validate/sample")
def validate_sample() -> JSONResponse:
    records = _load_sample_records()
    report = run_batch_compliance(records)
    return JSONResponse(content=json.loads(report.model_dump_json()))

@app.post("/api/validate")
def validate_batch(records: list[MaintenanceRecord]) -> JSONResponse:
    report = run_batch_compliance(records)
    return JSONResponse(content=json.loads(report.model_dump_json()))

I wired CORS permissively because this is a local PoC dashboard, not a public deployment. If I hardened the experiment, authentication and fleet tenancy would come first.

CLI presentation layer

main.py uses Rich tables because dispatchers often work from tabular summaries. I print fleet name, urgency counts, and per-record rows with color-coded urgency. The CLI became the source of truth for the GIF animation frames, which keeps marketing assets honest.

What surprised me during implementation

When I first ran the batch, refrigerated unit REF-3301 with three hundred seventy-two days since inspection scored emergency urgency combined with brake wear warning, not because brakes were critical but because overdue DOT inspection shares the emergency ladder with critical mechanical wear. I tightened the urgency agent so inspection overdue always surfaces at the top regardless of secondary warnings. Another surprise: TRK-1156 accumulated five simultaneous violations yet still produced a coherent action list because each violation maps to a distinct remediation step rather than collapsing into a generic alert.

Urgency mapping logic

The urgency agent applies a priority ladder I sketched on paper before coding. Critical brake wear, critical tire tread, and overdue DOT inspection always win emergency status. ELD hours exceeded and open defect reports escalate to high. Warning-level brake and tire issues with PM due soon land at medium. Documentation missing alone stays low. I wrote the ladder as explicit conditionals rather than a scored formula because explainability beat marginal accuracy in this PoC.

def urgency_agent(violations: list[ComplianceViolation], trace: list[str]) -> UrgencyLevel:
    if any(v in violations for v in (
        ComplianceViolation.BRAKE_WEAR_CRITICAL,
        ComplianceViolation.TIRE_TREAD_CRITICAL,
        ComplianceViolation.DOT_INSPECTION_OVERDUE,
    )):
        return UrgencyLevel.EMERGENCY
    if ComplianceViolation.ELD_HOURS_EXCEEDED in violations:
        return UrgencyLevel.HIGH
    # ... additional branches for defects, PM, warnings

Readers forking the repo should treat this function as the first customization point. Every fleet operates with different risk appetite, and swapping urgency tables is easier than rewriting an LLM prompt.

Confidence scoring rationale

Confidence is not model probability here; it is a heuristic transparency score. Base confidence starts at 0.74. Multiple violations add 0.07. Emergency or high urgency adds 0.06. A complete maintenance log adds 0.04. The cap is 0.97 because I never want a deterministic rules engine pretending certainty it cannot justify. When I present confidence in the CLI, I intend it as a sort key for human review, not as a statistical guarantee.

Part 4 — Operations, Reflections, and Next Steps

Let's Setup

Step-by-step details can be found at the repository README: https://github.com/aniket-work/FleetGuard-AI

Locally, the setup path I used looks like this:

Clone the repository and create a virtual environment inside the project root.
Install dependencies from requirements.txt, which pins LangGraph, FastAPI, Rich, and Matplotlib among others.
Run python main.py to validate the sample JSON batch and print the Rich summary table.
Optional: run python main.py --chart to generate images/validation-stats-chart.png.
Optional: start uvicorn fleetguard.api.server:app --app-dir src --reload and open the dashboard at port 8000.
Optional: execute pytest tests/ -v to confirm critical brake detection and batch confidence thresholds.

I recommend keeping the virtual environment inside the repository folder for PoCs like this so paths to data/sample_fleet_records.json remain predictable when you run scripts from different working directories.

Environment variables are not required for the default demo. If you extend the project with embedding-based retrieval or an external LLM provider, add keys through a .env file but do not commit secrets.

Let's Run

Running python main.py against the bundled dataset typically reports eight processed records with three emergency, one medium, one low, and three normal urgency findings in my latest run. Emergency items included TRK-1042 with critical brake wear and steer-axle tire concern, REF-3301 with overdue DOT inspection beyond three hundred sixty-five days, and TRK-1156 with critical brakes at eight percent pad life, critical tire tread at one point eight thirty-seconds, ELD hours at sixty-two point five, two open defects, and an odometer discrepancy note. Average confidence landed around eighty-three percent across the mixed batch.

The CLI output ends with an ASCII-friendly table that mirrors what I animated in the repository GIF. That was intentional: I wanted the marketing asset and the actual tool output to tell the same story.

For API mode, POST /api/validate/sample returns JSON containing per-record recommended_actions and policy_refs. The dashboard renders urgency cards and a results table without a separate frontend build step.

To generate analytics assets, run python main.py --chart. It writes images/validation-stats-chart.png, which I included in the README so GitHub visitors see quantitative behavior immediately.

When I demo this PoC to myself after a break, I use a three-step smoke path: CLI batch, pytest, then dashboard POST. That sequence exercises the graph, validation rules, and HTTP layer without needing external services. If all three pass, I trust the repository state enough to publish diagrams or write about it.

For readers who prefer API exploration with curl, curl -X POST http://localhost:8000/api/validate/sample returns the full BatchComplianceReport JSON. I often pipe that output through jq to inspect a single record's recommended_actions array and verify policy references attached correctly.

Vehicle-class threshold tables I encoded

The business rules in this PoC are intentionally visible as Python dictionaries rather than buried in prompts. Class 8 brake pad critical threshold is twenty percent with warning at thirty percent. Class 7 critical is fifteen percent. Light-duty critical is ten percent because smaller vehicles carry different wear profiles. Refrigerated units use an eighteen percent critical threshold and a fifteen thousand mile PM interval instead of twenty five thousand. Tire tread critical defaults to two thirty-seconds with steer-axle notes raising the effective minimum to four thirty-seconds. I chose these numbers from publicly available FMCSA inspection guidance and simplified them for demo clarity. They are not fleet-specific standards.

ELD hour limits matter because hours-of-service violations carry fines independent of mechanical condition. When TRK-1156 reports sixty-two point five driving hours in seven days alongside mechanical emergencies, multiple agents fire without one silently overriding another. That combination is exactly the kind of multi-signal case I wanted LangGraph to handle.

Human-in-the-loop considerations

Even in a deterministic PoC, I reserved mental space for human review. Emergency readings should page a fleet manager, not close the loop autonomously. LangGraph supports interrupt_before on nodes, and I would place that interrupt before any future "execute out-of-service order" node if I connected to real dispatch systems. Today the action agent only returns text recommendations. That boundary matters ethically and practically.

Edge Cases and Testing Philosophy

I wrote three pytest cases covering critical brake emergency detection, clean record normal clearance, and batch length with confidence thresholds. Tests encode what I care about in this PoC: critical brake wear must never be downgraded quietly, and every normal result must include a dispatch clearance action.

Ambiguous dispatcher notes remain a weakness. A note mentioning "steer" in an unrelated context might still tighten tire thresholds if phrasing overlaps keywords. In production I would route low-confidence classifications to a human queue. Here, I expose agent_trace lists to make that queue possible later.

Vehicle class tables only cover four classes in this demo. A real fleet with mixed power units and trailers would need composite rules. That limitation is acceptable in a demo but worth documenting honestly.

Walking Through a Sample Record

To make the pipeline concrete, consider record REC-005 from the sample dataset: Class 8 tractor TRK-1156 reports brake pads at eight percent, tire tread at one point eight thirty-seconds, ELD hours at sixty-two point five, two open defects, and dispatcher notes mentioning odometer discrepancy and driver hours over limit. When I feed this batch through the graph, the brake tire agent flags critical brake wear below the Class 8 twenty percent floor and critical tire tread below the two thirty-seconds minimum. The ELD hours agent flags hours exceeded above sixty. The defect documentation agent flags open defects and odometer gap from notes. The compliance policy agent retrieves FMCSA references for each violation category. The action agent responds with immediate pull-from-dispatch, priority brake service, tire replacement, mandatory thirty-four-hour restart, defect resolution, and odometer audit steps with a two-hour response window.

That end-to-end path takes milliseconds in my local runs, but the value is not speed. The value is repeatability. Every record in the batch gets the same structured treatment, which means I can diff reports across code versions when I tune heuristics.

Policy Knowledge Base Design

I modeled the knowledge layer as a list of PolicyChunk dataclass rows rather than jumping straight to ChromaDB. Each chunk stores a code, title, body, vehicle class tag, and category. The search function tokenizes query and body text, counts overlap, and applies class and category boosts. This is intentionally primitive. From my experience building RAG demos, people sometimes spend days on embedding pipelines before validating whether retrieval inputs are structured correctly. Here, the interface is stable: search_policies(query, vehicle_class, category, top_k).

If I swap in embeddings later, agents above the knowledge layer stay unchanged. That separation mattered to me because it mirrors how I would approach a production migration: nail schemas and agent IO first, upgrade retrieval second.

Monolithic Prompt Versus Specialist Agents

Early in the experiment I tried a single prompt that asked for violations, urgency, citations, and actions in one JSON blob. It failed in boring ways: urgency would be conservative when citations were verbose, and ELD hour violations were mislabeled as PM overdue when mileage notes appeared in the same sentence. Splitting responsibilities into agents eliminated most of that cross-talk. Each agent receives only the fields it needs, writes to agent_trace, and returns a typed value the coordinator merges.

This aligns with how I think about LangGraph more generally. Nodes are functions with narrow contracts. Edges express control flow a dispatcher could whiteboard. When I revisit the monolithic prompt idea, it will be as a final summarization layer, not as the core reasoning engine.

Dashboard and Visualization Choices

The HTML dashboard uses a dark theme with urgency-colored badges because fleet monitoring tools I have seen in demos often default to sterile white tables that hide urgency. I wanted emergency items to visually pop. The Matplotlib chart uses horizontal bars for urgency and top violation categories. Static charts export cleanly to README, which helps GitHub visitors who never run the code still grasp batch behavior.

Performance and Operational Notes

The sample batch of eight records completes in under a second on my laptop, including graph orchestration overhead. I did not optimize for throughput because validation batches in this PoC are tiny. If batches grew to thousands of historical records, I would parallelize per-record validation outside LangGraph or shard by fleet while keeping the same agent functions.

Memory footprint stays small because records are pydantic objects in a list, not streamed from a telematics broker. Logging is print-based in the CLI and trace lists in results. A next step would be structured JSON logs with record IDs for observability, but I skipped that to keep the repository approachable.

Theory Behind Response Hours

Response hours encode how quickly I believe a fleet manager must show correction before roadside risk or citation exposure grows. Critical brake wear and overdue DOT inspection map to two hours in my heuristics because out-of-service conditions cannot wait a business week. Documentation gaps map to seventy-two hours because they are serious but not always an immediate roadside event. These numbers are not organizational standards; they are PoC placeholders where I tried to mirror operational urgency I observed in fleet compliance materials.

Integration Points I Deliberately Left Open

The FastAPI layer returns pydantic-serialized JSON ready for webhooks, SMS workers, or transportation management dashboards I did not build. agent_trace arrays are the extension point for human review queues: a UI could highlight low-confidence rows or records where mechanical and regulatory agents disagreed with a prior human label. The compliance policy agent returns string citations today; tomorrow it could return structured objects with URLs to regulatory documents.

I also left authentication out entirely. If this ever faced real fleet credentials or tenant data, isolation and role-based access would precede any model upgrade.

Ethics and Boundaries

Automating fleet compliance validation touches driver safety, carrier CSA scores, and economic outcomes for fleet operators. I want to be explicit: this PoC does not place out-of-service orders, does not lock ELD devices, and does not connect to live dispatch control planes. It sorts records you give it. If someone deployed a derivative system, human review on emergency outcomes would be non-negotiable in my view, and rule changes would require traceable versioning.

I also avoided implying employer affiliation. This build lives entirely in my personal GitHub namespace as an experiment.

Future Roadmap I Would Explore

If I continue this PoC, my next steps would be:

Replace lexical policy search with embeddings stored in Chroma or pgvector, then measure citation precision against a held-out record set.
Add an LLM summarization step that rewrites recommended actions into dispatcher runbook language while keeping structured JSON underneath.
Introduce vehicle maintenance history so repeat brake wear patterns escalate urgency automatically.
Export compliance ticket drafts with policy footnotes for transportation management systems.
Build a feedback endpoint where a human reviewer marks agent mistakes, creating training data for later fine-tuning.

None of those are implemented here. They are the natural extension points I noted while writing the graph skeleton.

What I would not change

If I rebuilt the PoC from scratch, I would keep three decisions exactly as they are: strict pydantic schemas at the boundary, separate specialist modules instead of one prompt file, and agent_trace on every result. Those choices cost almost nothing in lines of code but paid off every time I debugged a surprising urgency label.

Reader exercises

If you fork the repository, here are three exercises I found useful while learning:

Add a new record to sample_fleet_records.json with ambiguous dispatcher notes and write a test that asserts your expected urgency.
Introduce a sixth agent that drafts a plain-language summary paragraph per record without altering structured fields.
Swap lexical policy search for an embedding model and compare citation precision on ten hand-labeled examples.

Each exercise touches a different production concern: evaluation, human-readable output, and retrieval quality.

Comparing Batch Outputs Across Iterations

One habit I kept from prior agent experiments is saving JSON reports when iterating heuristics. FleetGuard-AI supports --json-out on the CLI. Diffing two JSON files after a brake threshold tweak shows exactly which record IDs changed urgency or response hours. That practice sounds tedious, but it prevented regressions when I adjusted Class 8 brake critical percentages. Without structured output, I would rely on eyeballing terminal tables, which does not scale past a handful of records.

Why Fleet Compliance and Not Another Domain

I rotate domains in my personal agent experiments to avoid repeating myself. Fleet maintenance sat outside my recent project history and offered a rich rule surface: federal inspection windows, mechanical wear curves, hours-of-service limits, defect documentation, and regulatory policy all intersect in one record. That intersection is what makes multi-agent validation interesting. A single monolithic classifier would need to simultaneously reason about mechanics, regulation, and operations. Splitting agents let me test each concern independently and combine results with traceable provenance.

Closing Thoughts

FleetGuard-AI started as a question about dispatch triage labor, not as a pitch for autonomous fleet management. What I ended up with is a compact LangGraph workflow that feels honest about its limits: deterministic agents, explicit traces, readable tests, and an API that could swap underneath without rewriting business logic.

From my experience, the hardest part of multi-agent tutorials is not drawing boxes on a whiteboard. It is choosing a domain where evaluation is grounded. Maintenance records gave me that grounding. When the urgency agent mistyped an ELD hours violation, tests failed or traces looked wrong immediately. When mechanical heuristics worked, the CLI table matched my own gut ranking.

If you are building your own orchestration experiments, steal the structure more than the rules. Keep schemas strict, keep agents narrow, keep orchestration visible, and keep humans in the loop for emergency outcomes. The repository at https://github.com/aniket-work/FleetGuard-AI is there if you want to run the same path I did and remix it for a completely different operational domain.

One last reflection: agent hype often focuses on autonomy. This project convinced me that disciplined validation is the more interesting problem. Fleet records do not need a chatty assistant; they need reliable sorting, traceable reasoning, and fast handoff to humans when stakes are high. LangGraph helped me express that validation story in code without pretending the PoC is more than an experiment.

I will keep iterating on personal agent projects like this because each one teaches a reusable pattern. FleetGuard-AI's pattern is narrow agents, explicit graph state, honest tests, and dashboards that mirror terminal truth. That combination is what I would carry into the next domain, whatever it ends up being.

Cover animation for readers who prefer visual summaries:

Disclaimer

The views and opinions expressed here are solely my own and do not represent the views, positions, or opinions of my employer or any organization I am affiliated with. The content is based on my personal experience and experimentation and may be incomplete or incorrect. Any errors or misinterpretations are unintentional, and I apologize in advance if any statements are misunderstood or misrepresented.

Tags: python, langgraph, ai, fleet