OpenReview survey · source date 2026-05-14 · 0 comments ·
original
1. Problems / challenges / motivations
- The paper argues that real-world LLM-agent reliability is often constrained less by the base model than by the execution harness around it: environment, tools, context, orchestration, observability, evaluation, and governance.
- Prompt engineering and context engineering are no longer enough for production agents....
engineering postmortem · source date 2026-04-23 · 0 comments ·
original
1. Problems / challenges / motivations
- Anthropic describes Claude Code quality regressions caused by product-layer changes rather than a simple base-model failure.
- Changes to reasoning effort, caching, and prompt instructions affected user experience in ways internal evals did not initially reproduce.
- This exposes a common production-eval gap: offline...
engineering blog · source date 2026-02-18 · 0 comments ·
original
1. Problems / challenges / motivations
- Production agents fail in ways that final-answer evals do not explain: wrong tool choice, weak memory retrieval, multi-step drift, brittle recovery, or incomplete task execution.
- Black-box LLM scoring is insufficient when agent behavior depends on orchestration, tools, business rules, and runtime context.
- Large...