xRead - Incorporating Artificial Intelligence into Clinical Practice (March 2026)
Ng et al. BMC Medical Informatics and Decision Making
(2025) 25:236
Page 3 of 24
transcription accuracy (measured through Word Error Rate or WER), time savings, clinician satisfaction or the impact on patient care. The review included empiri cal studies of various designs, including randomized controlled trials (RCTs), cohort studies, cross-sectional studies, comparative evaluations and proof-of-concept studies. Only studies published in English or with an English translation, and indexed up to February 16, 2025, were considered. Studies that did not involve AI-based transcription tools were excluded. This included studies conducted in clinical settings which did not involve a physician facing a patient, such as laboratory-based evaluations, reports generated by radiologists and/or pathologists, as well as those focusing on non-English language transcription. Additionally, conference abstracts, editorials, commen taries and opinion pieces that did not provide empirical data were excluded as well. Study selection All identified studies were imported into Covidence (Ver itas Health Innovation, Melbourne, Australia) to facili tate the screening process. Four independent reviewers (J.J.W.N., E.W., C.X.L.G. and G.Z.N.S.) screened titles and abstracts to exclude studies that did not meet the inclusion criteria. Studies passing this initial screening underwent a full-text review by two independent review ers (J.J.W.N. and E.W.). Discrepancies in study inclusion were resolved through discussion, with arbitration by a third, senior reviewer (Q.X.N., H.K.T. or S.S.N.G.) if necessary. Data extraction Data were extracted from each included study using a standardized data extraction form developed for this review. The extracted data encompassed study charac teristics, including the software or model used, the type of AI model, study design, clinical setting and country in which the study was conducted. Additionally, details about the study population, sample size, whether the study was vendor-initiated, the reference standard and the comparator type were also recorded. Key perfor mance metrics, such as F1 score, precision, recall, and WER, along with paper-specific outcomes related to AI transcription proficiency and key findings, were also included. Novel features of the AI transcription tools were documented to provide a comprehensive overview of the studies. Assessment of risk of bias and study quality The Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) tool [7] was used to systematically assess the risk of bias and applicability of the studies included in our review. QUADAS-2 is a widely used tool that
evaluates risk of bias across four key domains: patient selection, index test, reference standard and flow and timing [7]. Each domain was independently assessed for potential bias by reviewing key study characteristics and determining if the conduct or interpretation of results could have introduced bias. We also evaluated whether the applicability of each study matched our review question. For each domain, the risk of bias was rated as either “low,” “high,” or “unclear,” depending on the completeness and clarity of the study’s reported methods. Specifically, we looked at factors such as the selection of patients or datasets, how the index test was performed and inter preted, whether the reference standard was appropri ate, and if there were any exclusions that could have influenced the results. Any discrepancies were resolved through discussion among the reviewers (E.W., J.J.W.N. and X.Z.). This careful assessment ensured that the stud ies included in the analysis were reliable and applicable to our research objectives. Data synthesis Given that a meta-analysis was not feasible due to antici pated heterogeneity in the study design, interventions and outcomes, a narrative synthesis was conducted, as guided by Popay et al. [8]. Findings were narratively syn thesized by summarizing key outcomes such as transcrip tion accuracy, clinician satisfaction, impact on patient care, and usability, and by identifying common patterns across the studies. Results Literature retrieval A total of 5,244 records were initially identified through database searches. After removing 1,011 duplicates using Covidence, 4,233 studies were screened based on titles and abstracts. During this screening phase, 4,210 stud ies were excluded for not meeting the inclusion criteria, leaving 60 studies for full-text review. All of these studies were retrieved for detailed assessment. After applying the inclusion and exclusion criteria, 25 studies were included. As illustrated in Fig. 1, an additional four studies were identified through the forward and backward citation searching, bringing the final total to 29 studies. The key study characteristics and findings of the 29 studies [4, 9– 36] are summarised in Tables 1 and 2, respectively. Study results Table 1 provides an overview of each study’s design, set ting, participant information, AI transcription tools and indication of vendor involvement (or not). Study designs ranged from RCTs [11, 16] to comparative or observa tional studies [9, 10, 12, 20, 21], with a growing number of recent publications employing qualitative or pre-post
Made with FlippingBook - professional solution for displaying marketing and sales documents online