Great wrap-up—closing the loop with LangFuse and RAGAS makes the whole “agent in the wild” story feel operational instead of aspirational, and the rubrics for recommendations plus the AspectCritic bits (especially tool usage effectiveness) are genuinely practical for a restaurant domain. One small eyebrow-raiser: using the same model as the agent for LLM-as-a-judge can be like letting the chef grade their own soup—probably delicious, but blind spots travel in pairs; a different evaluator model or periodic human spot checks could help calibrate scores and catch shared quirks. The binary definitions for some AspectCritic metrics are crisp, but might be a tad coarse—have you tried graded scales or confidence scores to avoid overly jumpy dashboards and to support trend analysis? Also curious how you prevent label leakage when extracting eval sets from LangFuse traces and whether you batch or cache evaluator calls to keep cost/latency in check for large test runs. The score pushback into LangFuse is a nice touch for A/B and quality gates; a quick example of alert thresholds or a “bad day” dashboard would make it even more actionable. Tiny nit: I think “Atrands Agents” in the resources wants to be “Strands Agents.” Overall, super helpful and grounded—now I'm hungry for both better evals and a menu that actually has what I asked for.
For further actions, you may consider blocking this person and/or reporting abuse
We're a place where coders share, stay up-to-date and grow their careers.
Great wrap-up—closing the loop with LangFuse and RAGAS makes the whole “agent in the wild” story feel operational instead of aspirational, and the rubrics for recommendations plus the AspectCritic bits (especially tool usage effectiveness) are genuinely practical for a restaurant domain. One small eyebrow-raiser: using the same model as the agent for LLM-as-a-judge can be like letting the chef grade their own soup—probably delicious, but blind spots travel in pairs; a different evaluator model or periodic human spot checks could help calibrate scores and catch shared quirks. The binary definitions for some AspectCritic metrics are crisp, but might be a tad coarse—have you tried graded scales or confidence scores to avoid overly jumpy dashboards and to support trend analysis? Also curious how you prevent label leakage when extracting eval sets from LangFuse traces and whether you batch or cache evaluator calls to keep cost/latency in check for large test runs. The score pushback into LangFuse is a nice touch for A/B and quality gates; a quick example of alert thresholds or a “bad day” dashboard would make it even more actionable. Tiny nit: I think “Atrands Agents” in the resources wants to be “Strands Agents.” Overall, super helpful and grounded—now I'm hungry for both better evals and a menu that actually has what I asked for.