xRead - Incorporating Artificial Intelligence into Clinical Practice (March 2026)
Ng et al. BMC Medical Informatics and Decision Making
(2025) 25:236
Page 18 of 24
Automatic summaries had higher word count and lower lexical diversity.
Key Findings Novel Features
Used both open-label and masked study designs
Two-stage evaluation: ROUGE metric comparison and manual annotation for information recall accuracy
Collaboration between the system and students leads to the best results, with a
decrease in time spent on summarising in combination with a similar quality when compared to manual summarisation.
Study was independently conducted and focused on real-world usability and safety concerns rather than just techni
cal accuracy. AI tools require continuous independent testing, as when tested with real-world scenarios, many errors arose that could compromise patient safety. Given that proprietary AI algorithms
frequently evolve, ongoing safety assess ments are essential.
Mixed-method evaluation including survey feedback and usability scales
Integrated AI-generated SmartSections in EHR workflows
Patients perceived less clinician distraction with AI scribing, but no im provement in perceived engagement
Fine-tuned models significantly outperformed zero-shot models; BART-Large-CNN was best for ED consultations
The study explores the impact of a digital scribe system on the clinical
documentation process, demonstrat ing the use of the system in reducing
summarization time while maintaining summary quality through collabora
tive editing, this study highlights the potential of digital scribe systems
to address the challenges of clinical documentation.
Omission type errors, whereby the tool leaves out key information from
its response, were the most common, which poses safety risks; clinicians may struggle to identify omission errors due to reliance on memory recall.
Standardised evaluation frameworks and real-world testing required to
mitigate AI-related safety concerns.
AI-assisted documentation reduced after-hours work by 30% and increased same-day appointment closure by 9.3%.
AI scribing led to significant time sav ings but had variability across users.
AI Transcription
Proficiency (paper
specific outcomes)
Patient satisfaction and engagement NR No significant dif ference in Patient
Doctor Relationship Questionnaire-9
(PDRQ-9) scores
BART-Large-CNN had
highest performance: ROUGE-1 F1 = 0.49,
ROUGE-2 F1 = 0.23,
ROUGE-L F1 = 0.35
Modified Physician
Documentation Qual
ity Instrument (PDQI 9), overall (IQR):
Manual: 31 (27–33)*
AS edited: 29 (26–33)* AS: 25 (22–28)* P value: <0.001
Errors by ADS (mean) 44 5 5 9.5
and clinician burden NR 20.4% less time
spent in notes per appointment
Time spent on notes, documen tation burden NR Median daily docu
mentation reduced by 6.89 min
Metric (F1 score,
Precision, Recall, WER)
ROUGE-1, ROUGE 2, ROUGE-L
Recall-Oriented
Understudy for
Gisting Evalua
tion-1 F1 score in % (IQR):
47.3 (42.5–56.4)
40.6 (35.0-45.4)
32.3 (27.0-37.4)
P value: <0.001 WER (SD)
2.9 (2.7)
Summarization ac
curacy and recall
Automatic sum
maries (edited by humans) Automatic
summaries
Omission, addi
tion, wrong out
put, misplaced/ irrelevant text
Comparator Type Subcategories Performance Zero-shot versus Fine-tuned models
NR With versus without DAX
Blinded comparison Manual
Comparison of ADS generated notes with expert-re
viewed transcripts
Pre-post intervention Time efficiency Baseline versus AI intervention
Standard
Nurse summary
notes from EMR
Highest scor
ing manual
summary
Expert-reviewed
transcripts from real patient
encounters
Time in notes
per appointment
EHR usage time metrics
Table 2 (continued)
Study Reference Owens et al.,
2024
[29]
Sezgin et
al., 2024 [30]
van Bu
chem et
al., 2024 [31]
Biro et
al., 2025 [32]
Duggan et al.,
2025
[33]
Ma et
al., 2025 [34]
Made with FlippingBook - professional solution for displaying marketing and sales documents online