Projects
The Doctor Will (Still) See You Now: On the Structural Limits of Agentic AI in Healthcare
A qualitative study based on interviews with 20 stakeholders examining how agentic AI is defined, evaluated, and constrained in healthcare, identifying three mutually reinforcing tensions: conceptual fragmentation, an autonomy contradiction, and an evaluation blind spot.
Expert Evaluation and the Limits of Human Feedback in Mental Health AI Safety Testing
A mixed-methods study examining inter-rater reliability among three psychiatrists evaluating 360 LLM-generated mental health responses, revealing systematic expert disagreement driven by incompatible clinical frameworks rather than measurement error.
Responsible AI in the Global Context
A global survey-based study exploring responsible AI practices across 1000 organizations in 20 industries and 19 regions, defining a conceptual RAI maturity model.
The Measurement Imbalance in Agentic AI Evaluation Undermines Industry Productivity Claims
A systematic review of 84 papers (2023–2025) exposing an evaluation imbalance in agentic AI, where technical metrics dominate (83%) while human-centered, safety, and economic dimensions remain peripheral, with a proposed four-axis evaluation framework.
More than Marketing? On the Information Value of AI Benchmarks for Practitioners
A qualitative interview study with 19 practitioners in academia, product, and policy examining how AI benchmarks are used to inform decision-making, finding that benchmarks serve as relative performance indicators but often lack the real-world relevance needed for substantive deployment decisions.
LeRAAT: LLM-Enabled Real-Time Aviation Advisory Tool
A real-time advisory system leveraging large language models to assist aviation professionals with decision-making during complex operational scenarios.