xRead - Incorporating Artificial Intelligence into Clinical Practice (March 2026)

First page Table of contents Previous page 13 Next page Last page

BMC Medical Informatics and Decision Making

Ng et al. BMC Medical Informatics and Decision Making https://doi.org/10.1186/s12911-025-03061-0

(2025) 25:236

Open Access

SYSTEMATIC REVIEW

Evaluating the performance of artificial intelligence-based speech recognition for clinical documentation: a systematic review

Joel Jia Wei Ng 1† , Eugene Wang 1† , Xinyan Zhou 1† , Kevin Xiang Zhou 2† , Charlene Xing Le Goh 1 , Gabriel Zheng Ning Sim 1 , Hiang Khoon Tan 3,4,5 , Serene Si Ning Goh 1,6,7 and Qin Xiang Ng 4,7*

Abstract Background Clinical documentation is vital for effective communication, legal accountability and the continuity of care in healthcare. Traditional documentation methods, such as manual transcription, are time-consuming, prone to errors and contribute to clinician burnout. AI-driven transcription systems utilizing automatic speech recognition (ASR) and natural language processing (NLP) aim to automate and enhance the accuracy and efficiency of clinical documentation. However, the performance of these systems varies significantly across clinical settings, necessitating a systematic review of the published studies. Methods A comprehensive search of MEDLINE, Embase, and the Cochrane Library identified studies evaluating AI transcription tools in clinical settings, covering all records up to February 16, 2025. Inclusion criteria encompassed studies involving clinicians using AI-based transcription software, reporting outcomes such as accuracy (e.g., Word Error Rate), time efficiency and user satisfaction. Data were extracted systematically, and study quality was assessed using the QUADAS-2 tool. Due to heterogeneity in study designs and outcomes, a narrative synthesis was performed, with key findings and commonalities reported. Results Twenty-nine studies met the inclusion criteria. Reported word error rates ranged widely, from 0.087 in controlled dictation settings to over 50% in conversational or multi-speaker scenarios. F1 scores spanned 0.416 to 0.856, reflecting variability in accuracy. Although some studies highlighted reductions in documentation time and improvements in note completeness, others noted increased editing burdens, inconsistent cost-effectiveness and persistent errors with specialized terminology or accented speech. Recent LLM-based approaches offered automated summarization features, yet often required human review to ensure clinical safety.

† Joel Jia Wei Ng, Eugene Wang, Xinyan Zhou and Kevin Xiang Zhou contributed equally and should be considered as co-first authors. *Correspondence: Qin Xiang Ng ng.qin.xiang@u.nus.edu Full list of author information is available at the end of the article

© The Author(s) 2025. Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creati vecommons.org/licenses/by-nc-nd/4.0/.

Made with FlippingBook - professional solution for displaying marketing and sales documents online