ToolSimulator lands in Strands Evals SDK to scale safe testing of agent tool calls

ToolSimulator, an LLM-powered tool-simulation feature in the Strands Evals SDK, provides a way to exercise agents’ integrations with external APIs at scale. It aims to reduce reliance on brittle static mocks and risky live calls by generating realistic simulated tool responses for multi-turn workflows and edge cases.

LLM-driven simulations for agents that call external tools
Available now as part of the Strands Evals SDK
Designed to prevent live-API risks and brittle static mocks

What happened

AWS introduced ToolSimulator, a tool-simulation framework integrated into the Strands Evals SDK that uses large language models to emulate external tool responses. The capability is intended for validating AI agents that interact with third-party services, enabling multi-turn and edge-case testing without issuing real API calls. ToolSimulator is positioned as an alternative to static mocks and direct test calls, letting teams run scalable simulations to exercise agents’ integration logic and error handling.

Why it matters

Testing agents against live services can expose sensitive data, trigger unintended actions, and be costly or slow; static mocks often fail to capture realistic, multi-step interactions. An LLM-driven simulator lets teams exercise complex workflows and failure modes safely and at scale, improving confidence that agents will behave correctly in production. By catching integration bugs and edge cases earlier in the development cycle, ToolSimulator can speed iteration and reduce the risk of shipping agents that mis-handle external tool responses.

What to watch next

Teams will be watching how realistic and reliable the simulated responses are across a variety of tool types and multi-turn scenarios, and how easily ToolSimulator integrates with existing CI/CD and evaluation pipelines in Strands Evals. Expect early adopters to publish examples and benchmarks comparing simulated behavior to live interactions. Also monitor AWS documentation, SDK examples, and community feedback for guidance on configuration, limits, and best practices for combining simulated and live testing in production-readiness workflows.

Source assisted: This briefing began from a discovered source item from AWS Machine Learning Blog. Open the original source.