Most research-agent demos optimize for the final answer.
That is the least useful place to debug them.
The operational questions show up earlier:
- how the research brief was framed
- what source directions were chosen
- whether the source mix was too narrow
- how the synthesis was assembled
- whether the final report preserved confidence and disagreement
That is why we built open-deep-research-workbench:
https://github.com/Tokvera/open-deep-research-workbench
It is a small Node starter that takes a research brief and turns it into:
- a research plan
- source directions
- a citation-aware synthesis
- recommended next steps
- one Tokvera root trace for the whole workflow
Why this is a better starting point than a flashy research demo
A final answer can look polished even when the workflow behind it is weak.
That is why teams need workflow-level visibility for research agents.
This starter keeps the work inside one root trace:
research brief
-> plan_research
-> collect_sources
-> synthesize_report
-> return report + citations
Stack
- Node.js
- Express
- OpenAI
- Tokvera JavaScript SDK
- Zod
Mock mode is enabled by default, so it is easy to run locally.
Quick start
git clone https://github.com/Tokvera/open-deep-research-workbench.git
cd open-deep-research-workbench
npm install
copy .env.example .env
npm run dev
The server starts on http://localhost:3400.
Endpoints
GET /healthGET /api/demo-briefGET /api/sample-briefsPOST /api/research
Example request
curl -X POST http://localhost:3400/api/research \
-H "Content-Type: application/json" \
-d '{
"topic": "How engineering teams should evaluate coding agents before letting them open pull requests",
"audience": "Platform and application engineering leads",
"goals": [
"Find the main reliability and review concerns around coding agents",
"Collect practical examples of evaluation workflow design",
"Summarize what observability signals matter before production rollout"
],
"timeframe": "current developer guidance"
}'
Why the root trace matters
Research-agent failures are usually lineage failures.
The brief may be weak.
The source directions may be too narrow.
The synthesis may flatten disagreement.
Without one root trace, you only argue about the final answer.
With one root trace, you can inspect where the workflow drifted.
Useful follow-up links
- Repo:
- Website post:
- Multi-step workflow page:
- Agent workflow debugging:
Top comments (0)