xRead - Incorporating Artificial Intelligence into Clinical Practice (March 2026)

First page Table of contents Previous page 40 Next page Last page

FIGURE 2 | Histogram of the year of publication distribution (binwidth = 1 year).

Accuracy of responses to patient questions was evaluated in 34 studies, ranging from 69.2% to 98.3% over 13 stud ies [31, 40, 62–64, 94, 98, 100, 101, 123, 128, 161, 174], while 21 studies used scales [34, 39, 42, 44, 70, 76, 77, 99, 104, 106, 111, 113, 122, 146, 157, 159, 165, 168, 169, 176, 178]. Four stud ies utilized the DISCERN instrument to characterize the qual ity of answers, ranging from 35.83 to 50.7 [44, 99, 104, 159]. There were three studies that assessed the safety of LLM re sponses, finding that 0%–5% of answers were potentially harm ful [34, 168, 174]. Nineteen studies investigated the readability of LLM responses [31, 44, 51, 76, 95, 97–100, 104, 106, 113, 123, 126, 146, 159, 160, 165, 169]. Of these, 16 studies [31, 44, 95, 97, 99, 100, 104, 106, 113, 123, 126, 146, 159, 160, 169] graded readability of LLM outputs using the Flesch–Kincaid Reading Ease Scale (FRES) and 15 studies [31, 44, 76, 95, 97–100, 104, 106, 126, 146, 159, 160, 169] used the Flesch– Kincaid Grade Level (FKGL) score ranging from 25 to 77.27 and 5.3 to 14.2, respectively. Five studies utilized the Simple Measure of Gobbledygook score to evaluate readability, rang ing from 9.9–30.9 [95, 99, 104, 123, 159]. Eight studies prompted ChatGPT to give responses with increased readability, find ing improvements of 13.3–30.5 in FRES and 1.5–5.2 in FKGL scores [51, 76, 95, 98, 113, 126, 159, 160]. Three of these studies [95, 126, 160] assessed ChatGPT's ability to adapt patient materi als to Grade 5–6 reading levels, finding FKGL score reductions of 3.2–5.23 points, resulting in final FKGL scores between 5.3 and 7.6, meeting their goal in two studies [126, 160].

Three studies compared LLMs, finding that ChatGPT outper formed Gemini (Bard) and Copilot (Bing) in accuracy and qual ity of responses, but Copilot had the best readability [97, 146, 168]. Bard was found to be less safe than ChatGPT [168]. Four studies applied NLP to outpatient monitoring [21, 56, 162, 163]. Ma et al. 2021 showed that automated health chats for H&N cancer patients undergoing radiotherapy cap tured some adverse effects more effectively than physicians. Two studies utilized NLP to analyze sentiment and emo tions surrounding treatment for patients with H&N cancers [162, 163]. 3.3.2 | Electronic Medical Record Improvement There were 14 studies where NLP was used in applications that could improve electronic medical records (EMR) and doc umentation [48, 58, 83, 87, 96, 105, 117, 121, 135–137, 150, 151, 167]. Four studies evaluated NLP models for documentation tasks [83, 96, 121, 150]. Two studies automatically generated operative notes from intraoperatively spoken keywords, demon strating time savings but mixed accuracy [83, 121]. Four stud ies implemented NLP for EMR information retrieval tasks such as retrieving diagnoses, past treatments, and post-operative complications, demonstrating mixed results for accuracy [48, 87, 117, 137]. Another four studies evaluated accuracy for classifying diagnostic imaging findings, ranging from 83% to 96.5% [58, 135, 136, 167].

3052

The Laryngoscope, 2025

Made with FlippingBook - professional solution for displaying marketing and sales documents online