xRead - Incorporating Artificial Intelligence into Clinical Practice (March 2026)

First page Table of contents Previous page 16 Next page Last page

Ng et al. BMC Medical Informatics and Decision Making

(2025) 25:236

Page 4 of 24

Fig. 1 PRISMA flowchart showing the study selection process

approaches to capture both performance metrics and user perspectives [23, 28, 33–36]. These studies spanned diverse environments including emergency departments (EDs), inpatient wards, specialty outpatient clinics (e.g., gastroenterology, urology, dermatology) and simulated clinical scenarios. While some systems demonstrate impressive precision and recall Happe et al. achieved a precision of 0.73 and recall of 0.90 in a specialized vocabulary environment [10] and Suominen et al. reached F1 scores of up to 0.856 for nursing tasks [15], other studies highlight notable shortcomings. For instance, Lybarger et al. reported a much lower maximum F1 score of 0.49 [18], and Zhou et al. found an F1 score of 0.416 in nursing contexts despite real-world training data [19]. Similarly, WER ranged from as low as 0.087 in controlled scenarios (Issenman et al. [12]) to more than 2.9 in real-time, multi-specialty outpatient encounters (Biro et al. [32]). van Buchem et al. demonstrated modest ROUGE F1 scores (0.32 unedited vs. 0.41 human-edited) for automated summaries [31]; the fact that human editing still improved these outputs underscores the potential but also the current limitations of LLM-driven summarization. Key findings Accuracy and error rates

Workflow efficiency and time savings Results on workflow efficiency were mixed. While Zick et al. [9] and Issenman et al. [12] observed decreased turn around time (from days to hours or minutes), Hodgson et al. [16] and Blackley et al. [21] found that post-editing often negated potential time gains. More recent LLM based systems (e.g., Bundy et al. [23], Ma et al. [34]) claim to shorten overall documentation time for certain specialties, but these claims often relied on small sample sizes or single-site studies, limiting generalizability. Cost implications Cost-effectiveness was inconclusive across the included studies. Early work in EDs (Zick et al. [9]) suggested significant cost savings with ASR, whereas Issenman et al. [12] found that voice recognition could be twice as expensive in pediatric gastroenterology. These differ ences highlight how cost can vary based on clinical set ting, complexity of cases and existing staffing models. Clinical documentation quality and patient care Studies such as Almario et al. [13, 14] showed that AI assisted documentation captured more clinically rel evant red flags than physician-typed notes. Other research (e.g., Kodish-Wachs et al. [17], Sezgin et al. [30]) observed that high error rates or poor summariza tion fidelity could pose risks to patient safety, especially if omissions go unnoticed by overburdened clinicians.

Made with FlippingBook - professional solution for displaying marketing and sales documents online