The Measurement Imbalance in Agentic AI Evaluation Undermines Industry Productivity Claims
As industry reports claim agentic AI systems deliver double-digit productivity gains and multi-trillion dollar economic potential, the validity of these claims has become critical for investment decisions, regulatory policy, and responsible technology adoption. This paper demonstrates that current evaluation practices exhibit a systemic imbalance: technical metrics dominate assessments (83%), while human-centered (30%), safety (53%), and economic assessments (30%) remain peripheral, with only 15% incorporating both technical and human dimensions. Drawing on case studies from healthcare, finance, and retail where benchmark-strong systems failed in real-world deployment, the authors propose a balanced four-axis evaluation model spanning technical, human-centered, temporal, and contextual dimensions, and call on the research community to realign evaluation practices with deployment realities.