
Surprisingly enough, it seems some AI agents aren’t quite up to scratch on some basic business tests
Salesforce research finds single-turn tasks see only 58% success, while multi-turn effectiveness drops to 35% Reasoning models like gemini-2.5-pro tend to outperform lighter models CRMArena-Pro has proven to be a challenging benchmark Researchers from Salesforce AI Research have introduced a new benchmark – CRMArena-Pro – which uses synthetic enterprise data to access LLM agent performance…