xRead - Incorporating Artificial Intelligence into Clinical Practice (March 2026)

TABLE 1 | Number of studies by country. Country

128, 129, 139–141, 152, 154, 158, 171, 177] whereas 16 studies utilized various scales, predominantly Likert [42, 46, 47, 68, 74, 75, 90, 93, 111, 114, 122, 127, 148, 149, 173, 176]. ChatGPT's accuracies for treatment recommendations were 25%– 85.3% among 16 studies [25, 45, 47, 65, 67, 72, 75, 78, 79, 93, 103, 110, 111, 139, 148, 177]. ChatGPT 4 outperformed Gemini Advanced and Llama 2 in recommending treatments and was comparable to Claude 3 [47, 103, 148]. Of 23 studies evaluating diagnostic capabilities, 17 reported accuracies of 5.1%–100%, with 15 achieving 62.5%–100% [16, 27, 32, 33, 36, 46, 47, 52, 57, 72, 78, 79, 84, 92, 93, 116, 125, 127, 139, 152, 158, 171, 177]. In diag nostic workup comparisons, ChatGPT outperformed Gemini (Bard) in two studies, matched its performance in another two, but was outperformed by Claude 3 in a study by Schmidl et al. [36, 52, 84, 158]. Twenty-seven studies evaluated responses to general questions with accuracies from 37.3% to 87.2% [23, 27, 28, 42, 49, 50, 53, 66, 68, 74, 80–82, 85, 90, 94, 102, 108, 122, 128, 129, 140, 141, 149, 154, 173, 176]. Of these, 12 studies evaluated ChatGPT's responses to board-exam questions, finding accura cies of 47.3%–86% for ChatGPT-4 and 45.2%–57% for older models [23, 28, 49, 50, 66, 74, 81, 85, 140, 141, 154, 160]. Buhr 2024 compared ChatGPT4 to Claude 2 and Bard (Gemini), finding that ChatGPT4 outperformed its counterparts by 7.3% and 26.9%, respectively. Long et al. showed ChatGPT4's 75% accuracy on the Royal College of Physicians and Surgeons of Canada Otolaryngology questions and later developed ChatENT, a derivative trained on OHNS ref erences, which outperformed ChatGPT4 with 80% on American Board and 87% on RCPSC sample questions [66]. Four studies utilized NLP for trainee selection [91, 118, 132, 134]. Two studies used NLP to create clusters of residency and fellow ship applicants based on personal statements and recommenda tion letters, which could predict success in the match [91, 134]. Halagur et al. demonstrated demographic biases of ChatGPT in residency selection, while Farlow et al. did not find gender bias in ChatGPT generated reference letters. TABLE 2 | Number of studies investigating the respective subspecialty. Subspecialty Number of studies Not specified/OHNS overall 81 Head and neck cancer 54 Rhinology 15 Neurotology 9 Pediatrics 5 Laryngology 5 Facial plastics 3 3.4 | Research Uses 3.4.1 | Data Extraction and Analysis for Research FourteenstudiesusedNLPtoextractinformationfromclinicalnotes [38, 55, 86, 87, 115, 124, 136, 142, 145, 147, 155, 156, 170, 175]. Two of eleven studies reported accuracies at 83%–84% [136, 156], while

Number of studies

USA

71 16 12

China

Italy

Germany

9 7 7 6 4 4 4 4 3 3 2 2 2 1 1 1 1 1 1 1 1 1 1

France Turkey

UK

Australia Belgium

Canada Poland

Denmark

South Korea

India Japan Spain Brazil

Finland

Israel

Jordan

Lebanon

Netherlands

Portugal

Saudi Arabia

Taiwan

Thailand

3.3.3 | Triaging and Patient Classification Three studies used NLP for patient triaging [37, 52, 61] finding accuracies of 53.8% and 76.7% and an F1 of 0.76 for various cat egorization tasks, whereas 2 used NLP for classifying patients' symptoms to the appropriate medical department with accura cies of 75.5% and 93.5% [60, 131]. 3.3.4 | Clinician Education and Decision Support Of sixty studies evaluating NLP responses to clinical questions, 57 utilized ChatGPT [16, 23, 25, 27, 28, 32, 33, 36, 42, 45–47, 49, 50, 52, 53, 65–68, 72–75, 78–82, 84, 85, 90, 92–94, 102, 103, 108, 110, 111, 114, 116, 122, 127–129, 139–141, 148, 149, 152, 154, 158, 173, 176, 177]. Forty studies reported accuracies ranging from 5.1% to 100% (median 73.5) [16, 23, 25, 27, 28, 32, 36, 45, 49, 50, 52, 53, 65–67, 72, 78–82, 84, 85, 92, 94, 102, 103, 110, 116, 125,

3053

Made with FlippingBook - professional solution for displaying marketing and sales documents online