Every factory has two kinds of knowledge. There's the kind that lives in binders — official manuals, error code sheets, maintenance schedules. And then there's the other kind: the stuff that lives in people's heads. The fix that only works if you tap the relay housing first. The paint sprayer that always throws a TEMP_OVERRUN_03 when the morning crew forgets to flush the line. The robot arm that needs a specific reboot sequence that was figured out three years ago by a technician who no longer works there.
This second kind of knowledge — tribal knowledge — is the thing that actually keeps a factory running. And it disappears constantly. Every retired technician, every shift handoff, every new hire learning on the job from scratch is a leak in the system. I built Shadow Floor Engineer because I got tired of watching that knowledge evaporate.
What the System Does and How It Hangs Together
Shadow Floor Engineer is a locally-deployed AI agent that acts as a conversational interface for factory floor shift leads and technicians. Instead of handing someone a laminated error-code sheet when a machine fails, they open a browser tab and describe what's happening. The agent asks clarifying questions, surfaces relevant fixes — including ones pulled from previous shifts — and actively prompts the technician to document what actually worked in the chat.
That last part is the core of it. The agent isn't just answering questions. It's continuously absorbing new institutional memory. Every fix a technician types into the chat gets retained in a cloud memory graph. The next shift lead who hits the same error on the same machine gets the benefit of everything the previous team learned.
The stack is deliberately minimal:
• Flask (Python) as the backend API layer
• Groq Cloud API for fast LLM inference
• Hindsight Global Memory (Vectorize.io) as the persistent memory layer — specifically the factory1 memory bank
• A lightweight HTML/Vanilla JS frontend with glassmorphism styling — chosen so it runs on a factory floor tablet without any build toolchain
The Core Problem: Cold Starts and Knowledge Rot
The naïve version of this project would have been simple: hook a chat UI to an LLM and call it done. That version has a fatal flaw. A generic language model knows what a HYDRAULIC_FAULT_07 means in textbook terms. It does not know that on this specific press in this facility, that fault code almost always means the left side pressure sensor needs to be reset before the hydraulic line is pressurized — a quirk that was discovered the hard way after two hours of diagnostic dead-ends on a night shift.
So the system needed real memory. Not just context window memory — full persistent memory that accumulates across sessions, shifts, and weeks. That's where Hindsight comes in. Hindsight provides an agent memory layer that handles both retention (storing observations) and recall (retrieving relevant past context). I used it to maintain a continuously evolving knowledge graph specifically scoped to the factory1 memory bank.
The system handles two distinct situations. The first is a cold start — a query about a machine or error the agent has never encountered before. In that case, it falls back to the LLM's base training and gives a standard diagnostic response. But critically, it then prompts the technician explicitly: "What did you actually try? What ended up working?" That response isn't just a conversational reply. It immediately triggers a hindsight.retain() call and gets written to the memory graph.
The second situation is a warm start — a query about something the agent has seen before. Here, the Hindsight recall layer surfaces relevant past fixes before the LLM even generates a response, which means the agent's first answer to a returning problem already incorporates what worked last time. The difference in quality is noticeable.
Inside the Backend: Retain and Recall
The backend logic in backend.py is straightforward by design. Every incoming technician message goes through a two-step pipeline:
Step one — recall. Before constructing the prompt for Groq, the system queries Hindsight for relevant prior observations:
Retrieve relevant past fixes from Hindsight memory
recalled = hindsight_client.recall(
query=user_message,
bank='factory1',
top_k=5
)
Inject recalled context into the system prompt
memory_context = '\n'.join([m['content'] for m in recalled])
Step two — retain. Every technician response that describes an actual fix is written back to the memory graph:
After receiving technician's real-world fix
hindsight_client.retain(
content=technician_response,
bank='factory1',
metadata={
'machine': detected_machine_id,
'error_code': detected_error_code,
'shift': current_shift_id,
}
)
The metadata tagging is important. When the recall query runs, the Hindsight memory layer can use the machine ID and error code to surface the most contextually relevant past observations rather than just the most semantically similar text. A TEMP_OVERRUN_03 on Robot Arm 2 should recall fixes for Robot Arm 2, not generic temperature error responses from the paint shop.
The initial seeding is handled by a separate script, seed_hindsight_memory.py, which populates the factory1 bank with the baseline operational data for each piece of equipment before the agent goes live. This gives the agent a non-empty starting point — it knows the machines exist and has some foundational context — but the real value accumulates from actual shift interactions over time.
What It Actually Looks Like in Practice
A concrete example of how a session flows:
A technician opens the UI and types: "Paint Sprayer 3 is throwing TEMP_OVERRUN_03 again, we already checked the thermocouple". The agent does a recall query against the factory1 memory bank. If this error has been seen before, it surfaces the relevant prior fix — in this case, that the TEMP_OVERRUN_03 on Sprayer 3 is almost always caused by a clogged coolant bypass valve, not a sensor fault. The technician clears the valve, the machine comes back online, and the agent asks what the resolution was. The technician confirms the valve fix, and that response is retained with the sprayer's metadata.
The next time anyone asks about TEMP_OVERRUN_03 on Sprayer 3, that fix is the first thing the agent surfaces — before generic diagnostics, before the textbook response. That's agent memory doing what it's supposed to do: making the agent measurably smarter about your specific environment the more it's used.
Lessons Learned
A few things that are worth taking away from this build:
- The recall step is more valuable than the generate step. I initially underestimated how much the quality of the agent's responses would improve once Hindsight had a few weeks of real interactions in it. The LLM's role shifted from "source of truth" to "reasoner over retrieved evidence." That's the right relationship between a language model and institutional memory.
- Metadata tagging on retention is non-negotiable. Early versions retained observations without machine or error code tags. The recall results were semantically related but operationally useless — fixes for one machine surfaced for completely different equipment. Structured metadata on every retain() call made the recall results dramatically more precise.
- Cold starts are a feature, not a bug. The explicit fallback to base model knowledge, combined with the active prompt to document what actually worked, turns every unknown error into a learning event. The system gets better from its own gaps.
- Keep the frontend dumb. Vanilla JS and a single HTML file means this runs on anything with a browser. No Node, no build step, no dependencies. On a factory floor, simplicity is a feature.
- Seeding matters for day-one confidence. Without the initial seed data, the agent's first interactions feel hollow — it knows nothing specific about the facility's machines. Running seed_hindsight_memory.py before deployment gives it enough baseline context that the first shifts are productive rather than frustrating.
Where This Goes Next
The obvious next step is multi-facility support — extending the memory model so that fixes from one plant can optionally propagate to related facilities running the same equipment. The Hindsight memory architecture is designed to support this kind of scoped sharing, so the infrastructure is there.
Beyond that, the retention pipeline could be extended to capture passive observations — maintenance logs, sensor readings, shift reports — rather than relying solely on explicit technician input in the chat. The goal is to make the agent's memory as continuous as the factory floor itself.
Most AI tools get smarter by retraining on new data. Shadow Floor Engineer gets smarter by remembering what worked last Tuesday. That's a different kind of intelligence, and for a manufacturing environment, it's the kind that actually matters.
The project repository is available on GitHub. The memory layer is built on Hindsight by Vectorize.io. If you're thinking about agent memory for a domain with deep institutional knowledge, the pattern used here — retain on every technician response, recall before every generation — is a solid starting point.
Top comments (0)