xRead - Incorporating Artificial Intelligence into Clinical Practice (March 2026)

Ng et al. BMC Medical Informatics and Decision Making

(2025) 25:236

Page 19 of 24

Notably, more recent digital scribes and LLM-based solutions are designed to create structured summaries (e.g., SOAP notes), although these still generally require human review for accuracy. Clinician satisfaction, burnout and adoption Goss et al. [20] identified higher satisfaction when clini cians encountered fewer transcription errors and mini mal editing demands. Misurac et al. [28] and Shah et al. [36] further reported decreased burnout levels following the adoption of AI tools, indicating a potential benefit for clinician well-being. Despite improved satisfaction in some quarters, many clinicians expressed reluctance to rely fully on AI scribing, citing concerns about real-time error correction and the need for manual review (e.g., Bundy et al. [23], Moryousef et al. [35]). Risk of bias and study quality Most of the included studies were assessed to have a low risk of bias and low risk of applicability concerns, as shown in Table 3 and illustrated in Figs. 2 and 3. For studies identified to have a moderate or high risk of bias and applicability concerns, the greatest contributing fac tor was patient selection, followed by index test. This was possibly due to some studies having unclear patient selection criteria, and some studies having controlled test environments, which may bias the index tests. The QUADAS-2 assessment of the included studies showed predominantly low risk of bias and applicability con cerns, supporting the reliability of findings on LLMs as transcription tools in medical domains (Figs. 2 and 3). However, some studies, such as Zick et al. [9], Blackley et al. [21], Hodgson et al. [16], and Kodish-Wachs et al. [17], had high risk of bias in patient selection and applicability concerns, which may limit generalizability. Unclear refer ence standards in studies like Bundy et al. [23] and Goss et al. [20] suggest potential gaps in validation. While the overall low risk in flow and timing strengthens confidence in the results, variability in methodological rigor under scores the need for standardised evaluation for future studies to ensure consistent and reliable conclusions. Discussion Our review identified 29 studies that investigated the applications of ASR and NLP in medical transcriptions across a variety of clinical settings. The included stud ies spanned environments such as EDs, inpatient wards, specialized clinics (e.g., gastroenterology, psychiatry, and endocrinology) and even simulated scenarios replicating ambulatory primary care workflows. Owing to this diver sity and the significant heterogeneity in study designs, sample sizes and performance metrics, making direct comparisons was challenging. Nonetheless, the find ings underscore the wide-ranging potential of AI-based

Key Findings Novel Features

Assessed multiple AI scribe platforms for clinical documentation

Assessed usability using Stanford physi cian survey

75% of urologists found AI scribes use ful; concerns about accuracy remain.

AI documentation improved percep tions of efficiency and reduced task load.

AI Transcription

Proficiency (paper

specific outcomes)

Standardized consultation notes Multiple AI scribes Accuracy and usability NR Nabla had the highest accuracy (68%)

Physician burnout and usability NR Burnout decreased

significantly (-1.94 points)

Metric (F1 score,

Precision, Recall, WER)

Comparator Type Subcategories Performance sus post-intervention

Shah et al., 2025 [36] NR Pre-intervention ver

Standard

Table 2 (continued)

Study Reference Mo ryousef et al.,

2025

[35]

Abbreviations: ADS: Ambient Digital Scribe; CPT: Current Procedural Terminology; DAX: Dragon Ambient eXperience; EHR: electronic health record; EMR: electronic medical record; HPI: History of presenting illness; IQR: Interquartile range; KBM: keyboard and mouse; NR: not reported; LSTM: Long Short-Term Memory; SD: standard deviation; SR: speech recognition; VRS: voice recognition software; WER: word error rate *indicates statistical significance (†Adapted scoring system for HPI and Assessment & Plan (AP): Clarity (of HPI and AP), completeness (of HPI and AP), concision (of HPI and AP), sufficiency (of HPI and AP), prioritisation (of AP))

Made with FlippingBook - professional solution for displaying marketing and sales documents online