JMIR Medical Informatics

Propagation of Interpreter Errors by Ambient AI Scribes: Study Using Simulated Clinical Encounters

2026-07-28T16:45:10-04:00

In simulated English and Spanish clinical encounters, ambient AI scribes propagated interpreter errors into clinical notes, with patterns varying by speaker role and error type. These findings highlight the need for further evaluation of AI-scribe performance in multilingual and interpreter-mediated clinical care.

Diagnostic Code Ambiguity and Misclassification of Adults With Spinal Muscular Atrophy: Single-Center Chart Review

2026-07-28T15:00:05-04:00

In a single US health system, fewer than one in three adults flagged by spinal muscular atrophy–associated diagnostic codes had molecularly confirmed spinal muscular atrophy (positive predictive value 27%), with the majority miscoded across clinically distinct categories.

Mapping the Reliability-Readability Gap in the Education of Patients With Age-Related Macular Degeneration Across 6 Large Language Models: Comparative Evaluation Study

2026-07-28T13:15:08-04:00

Background: Artificial intelligence–generated health information is increasingly used by patients, but its reliability, visible transparency indicators, and readability remain uncertain in specialized ophthalmic conditions such as age-related macular degeneration (AMD). Objective: This study aimed to evaluate and compare the informational reliability, visible transparency indicators, overall quality, and readability of responses generated by 6 publicly accessible large language models (LLMs) to AMD-related patient-facing prompts under a zero-shot, single-turn prompting scenario. Methods: Thirty English-language AMD-related prompts were curated from Google Trends, the 2023 Chinese AMD guideline, and the 2025 American Academy of Ophthalmology Preferred Practice Pattern. Chinese guideline–derived prompts were translated and reviewed before model querying. Each finalized prompt was entered verbatim into ChatGPT-5.1-auto, DeepSeek-v3.2, Gemini-2.5-Flash-Thinking, Grok 4, Claude-Sonnet 4.5, and Qwen3-Max between October 10 and November 25, 2025. Two senior ophthalmologists (ZL and XM) blinded to model identity independently scored all responses using DISCERN, Ensuring Quality Information for Patients (EQIP), Global Quality Scale, and Journal of the American Medical Association benchmark criteria, with adjudication for disagreements. Readability was assessed using 6 standard formulas against a sixth-grade benchmark. Between-model differences were analyzed using Friedman tests with Holm-adjusted pairwise comparisons. Results: A total of 180 responses were analyzed. Interrater agreement was substantial to near-perfect across reliability instruments (κ=0.72‐0.97). No model met the recommended sixth-grade readability target. Grok 4 achieved the highest scores on reliability-related instruments, including DISCERN (mean 46.40, SD 7.43) and EQIP (mean 74.33, SD 9.07), whereas DeepSeek-v3.2 generated the most readable responses, with the highest Flesch Reading Ease Score (mean 48.23, SD 9.16) and lowest Flesch-Kincaid Grade Level (mean 9.95, SD 1.87). Significant between-model differences were observed across all reliability and readability metrics (all <.001). Conclusions: Under zero-shot, single-turn prompting conditions, the evaluated public LLMs showed substantial model-dependent differences in AMD-related patient education quality and readability. No model met the sixth-grade readability benchmark, including those with comparatively stronger reliability performance. These findings support clinician oversight, readability optimization, and further evaluation before LLM-generated AMD information is used directly in patient-facing settings.

Logic Models on Health Information Technology–Related Interventions: Scoping Review Across Disciplines

2026-07-28T13:15:08-04:00

Background: Health information technology (HIT) interventions are complex, context-dependent, and often insufficiently theorized, which can hinder their design, implementation, and evaluation. Program theory approaches such as logic models and theory of change (ToC) are well-established in public health and implementation science for articulating causal assumptions. Their use in medical informatics, however, appears inconsistent. A systematic overview of how logic models and ToC have been applied to HIT interventions is therefore needed to support theory-informed development and cumulative learning in the field. Objective: We aimed to map how logic models, ToC, and related program theory approaches have been conceptualized, constructed, and applied in HIT-related interventions across disciplines. A secondary objective was to identify implications for medical informatics research and practice. Methods: Following PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews), searches were conducted in PubMed, Web of Science, Academic Search Elite, APA PsycArticles, and CINAHL. Eligible publications used a logic model, ToC, or related construct within an HIT intervention in any health care or social-care setting. Next, 2 reviewers independently screened records and extracted data on study characteristics, type and purpose of HIT, model structure, theoretical foundations, and reported benefits and challenges. Results: A total of 69 publications (2012‐2025) met the criteria. Use of program theory increased markedly after 2020 and spanned medical informatics, public health, health services research, and implementation science. Logic models were most frequently applied to patient-facing and self-management technologies, particularly mobile health, telehealth, and home-based remote monitoring. Most models were used to support HIT development or evaluation. Of the total, 60 (87%) studies provided a logic model visualization, although structures varied considerably. Out of 69, 50 (72%) studies cited guidelines for model development, most commonly UK Medical Research Council guidance, realist evaluation, or the Kellogg Logic Model. Out of 69 studies, 28 (41%) used behavioral or implementation frameworks, such as Capability, Opportunity, Motivation–Behavior model (COM-B), Consolidated Framework for Implementation Research (CFIR), Expert Recommendations for Implementing Change (ERIC), Fit between Individuals, Task, and Technology (FITT), or Non-adoption, Abandonment, and challenges to the Scale-up, Spread, and Sustainability (NASSS) to populate model content. Only 3 (4%) studies reused an existing model. Reported benefits concerned improved theorization, structured evaluation, and stakeholder engagement; challenges included limited empirical evidence, high resource demands, and tensions between specificity and generalizability. Conclusions: Program theory approaches are increasingly used to conceptualize and evaluate HIT interventions; yet, their application in medical informatics remains fragmented. More systematic and theory-informed use of logic models could enhance conceptual clarity, methodological rigor, and cumulative learning. Future work should promote model reuse, establish repositories, strengthen reporting standards, and integrate program theory in HIT education and research to support coherent development, evaluation, and scaling of digital health interventions.

Dynamic Closed-Loop Medical Record Quality Management Using an AI-Driven Multilevel Quality Control System: Development and Implementation Study

2026-07-27T07:00:03-04:00

Background: As the core documentation of the clinical diagnosis and treatment process, the quality of medical records is directly related to medical safety and efficiency. However, traditional manual quality control (QC) has limitations such as limited coverage, low efficiency, and inconsistent standards, making it difficult to meet the high standards of modern hospitals. Objective: The purpose of this study was to develop a medical record QC platform based on AI technology and to build a multilevel QC system relying on the platform to comprehensively improve the form and content of electronic medical records (EMRs). Methods: This study established a QC rule database based on relevant medical record writing standards, combined deep learning and natural language processing technologies to build an AI-driven QC engine, designed a 4-level QC and management process, and implemented the system in a comprehensive tertiary hospital in China. Results: After applying AI for the first level of QC, the coverage rate of automated initial screening of EMRs reached 100%. On this basis, combined with manual participation in the second to fourth level of QC, a new model of medical record quality management was formed, featuring human-machine collaboration, interaction between QC units and clinical departments, efficient closed-loop operation, and equal emphasis on form and connotation. The completeness, timeliness, consistency, and accuracy of EMRs significantly improved, while various indicators, such as the grade A rate and excellent rate, continued to increase (all P<.001). Conclusions: This study demonstrated that the implementation of an AI-based 4-level QC management model helped overcome the limitations of traditional manual QC methods to a certain extent, facilitated comprehensive, dynamic, and closed-loop management of EMRs, and effectively improved the overall quality of medical records. This model can provide a reference for other hospitals seeking to reform their QC models.

Persona-Driven Data Augmentation for Disease Name Recognition Across Rare and General Disease Corpora: Comparative Evaluation Study

2026-07-24T16:30:21-04:00

Background: Medical information extraction requires automatically identifying disease names and related terms in text. This task, known as named entity recognition (NER), relies on expert-annotated data that are costly to produce and often available only in limited quantities. Data augmentation (DA) aims to expand available training data; however, standard techniques such as synonym replacement and back-translation may introduce inappropriate substitutions or fail to preserve entity-label alignment, which is critical for sequence-labeling tasks. Although large language models can generate fluent text, their outputs may also contain factual inconsistencies or unintended changes if not carefully controlled. Objective: This study investigated whether persona-driven, document-level DA using a large language model could improve biomedical disease NER performance by generating diverse rephrasings of medical documents while preserving annotated entities. Methods: We designed a DA framework using multiple personas that varied in medical expertise, personality, tone, and narrative style. Using prompting constrained by XML tags, each persona rephrased training documents while aiming to preserve annotated entity spans. We evaluated the framework on 2 biomedical disease NER datasets with complementary roles: RareDis, a low-resource rare disease corpus, and National Center for Biotechnology Information (NCBI) disease, a more general disease benchmark. Semantic fidelity and lexical diversity were measured using BERTScore and Bilingual Evaluation Understudy (BLEU-4), respectively, and personas were grouped into high-, balanced-, and low-fidelity subsets. Biomedical pretrained BioBERT models were fine-tuned and evaluated under multiple settings, including gold-standard (GS) data only, synonym replacement, single-persona augmentation, curated persona subsets, and all-persona augmentation. Performance was assessed using microaveraged entity-level precision, recall, and F₁-score, and results were examined at both the overall and individual entity-type levels. Performance values are reported as mean (SD). Results: Persona-driven DA improved NER performance over GS-only training in both datasets, with the strongest gains obtained by combining multiple persona-generated variants with GS data. In RareDis, the best result was achieved by the low-fidelity subset (mean F₁-score 73.35, SD 0.19 vs baseline 71.22, SD 0.45), while in NCBI disease, the all-personas setting performed best (mean F₁-score 89.32, SD 0.26 vs baseline 87.82, SD 0.18). In low-resource experiments, the all-personas and high-fidelity persona settings in NCBI disease exceeded the performance of the model trained on 100% GS data using only 60% of the training data, whereas gains in RareDis were more modest. Entity-level analysis showed improvements across RareDis categories, particularly for symptom, and confusion analysis indicated reduced symptom-sign confusion under augmentation. Conclusions: Persona-driven DA improved biomedical disease NER by introducing controlled linguistic variation while largely preserving annotated entities. The strongest gains were obtained when multiple persona-generated variants were combined with GS data, although the benefit varied across datasets. These findings suggest that this approach is a promising strategy for low-resource biomedical NER.

Large Language Models for Endodontic Symptom Assessment and Treatment Planning Using Image-Free Clinical Records: Comparative Evaluation Study

2026-07-24T16:30:15-04:00

Background: Accurate assessment of pulpal status is essential for achieving successful endodontic outcomes. However, direct evaluation remains inherently challenging because the pulp is surrounded by calcified tissue, necessitating reliance on clinical and radiographic examinations for diagnostic and prognostic decision-making. These procedures demand substantial clinical expertise and time, and less-experienced clinicians often face challenges that may lead to errors in diagnosis and treatment planning. Recent advancements in large language models (LLMs) offer promising opportunities to enhance clinical reasoning by facilitating the integration of evidence and supporting methodical diagnostic decision-making. Objective: This study aimed to evaluate the clinical applicability of LLMs by comparing their text-based clinical screening performance and the clinical validity of their treatment plan responses with those of human evaluators. Methods: Between January 2011 and December 2022, 100 clinical cases involving primary endodontic disease were randomly selected from the clinical records of outpatients who visited the Department of Conservative Dentistry or Advanced General Dentistry (AGD) at Yonsei University Dental Hospital. Four prompt types, combining 2 variables (language and role), were used as input for 4 LLMs. Both LLMs and human evaluators (AGD specialists, AGD residents, endodontic residents, and senior dental students) assessed the cases using text-based clinical records. Radiographic images were not directly provided. Screening performance was evaluated using a 0-to-2-point concordance scale, and treatment plan validity and relevance were assessed using a 5-point Likert scale. Results: Among the 4 LLMs evaluated, ChatGPT achieved the highest mean concordance score on Korean-doctor prompts (mean 0.98, SD 0.82). However, this score did not reach the partially correct criterion of 1 on the 0 to 2-point scale. Clova X recorded the lowest mean score on English-patient prompts (mean 0.23, SD 0.63). Across both diagnostic categories, AGD specialists demonstrated the highest diagnostic accuracy (pulpal: 0.70; periapical: 0.65), with higher sensitivity but lower specificity than those exhibited by the other groups. ChatGPT also showed favorable performance among the LLMs, with accuracies of 0.65 (95% CI 0.55‐0.74) for pulpal disease and 0.57 (95% CI 0.47‐0.69) for periapical disease, which were comparable to those of AGD and endodontic residents. Conclusions: Under image-free clinical record review conditions, ChatGPT 4.0 showed relatively higher and more consistent performance in symptom screening and treatment planning compared to the other LLMs evaluated. However, its highest mean score of 0.98 (SD 0.82) did not reach the partially correct criterion of 1 on the 0 to 2-point scale. Hallucinations generated by LLMs and experience-dependent interpretation biases among human evaluators remain key challenges that require attention. Therefore, continuous clinical supervision and comprehensive user training are necessary for the safe and effective clinical application.

R-R Interval Histogram-Based Deep Learning for 3-Class Atrial Fibrillation Screening in Garment-Type Wearable Holter Electrocardiogram Monitoring: Algorithm Development and Validation Study

2026-07-24T15:30:15-04:00

Background: Long-term garment-type wearable Holter electrocardiographic (ECG) monitoring is frequently affected by noise contamination, which complicates automated atrial fibrillation (AF) detection in real-world recordings. Although deep learning has shown high performance for AF detection, relatively few studies have evaluated explicit strategies for handling noise-included wearable ECG data. An alternative representation using the R-R interval (RRI) time series may reduce the dependence on waveform morphology and provide an alternative pathway for AF screening in noisy recordings. Objective: This study aimed to develop and evaluate a 3-class, noise-aware RRI-based AF screening framework that explicitly separated AF, non-AF, and uninterpretable noise windows, and to assess the impact of analysis window length on model performance. Methods: Single-lead garment-type wearable Holter ECG data from 117 patients at the University of Osaka Hospital were analyzed after exclusion of patients with documented atrial tachycardia, flutter, or paced rhythm according to the predefined task definition. R-peaks were automatically detected, and the resulting RRI segments were converted into 2D histogram images, with time on the x-axis and RRI-derived heart rate on the y-axis, for 1.5-, 3-, and 6-minute windows. A ResNet-34–based 2D convolutional neural network was trained for 3-class classification. Model performance was evaluated using 5-fold interpatient cross-validation on the institutional dataset and independent external testing on the MIT-BIH (Massachusetts Institute of Technology–Beth Israel Hospital) AF Database (AFDB). In the external validation, atrial flutter–annotated intervals were excluded to match the training task definition. Patient-level AF burden was evaluated by comparing reference AF burden with model-estimated AF burden using Pearson and Spearman correlation coefficients, and linear regression. Results: Of 129 monitored patients between March 1, 2023, and November 20, 2025, 117 were analyzed. In the internal validation, the 3-class model (non-AF, AF, and noise) showed similarly high performance for the 1.5- and 3-minute windows, both with an accuracy of 96.6%. In independent external validation, the 3-minute window showed numerically the highest overall performance (accuracy: 97.3%; AF sensitivity: 96.9%; and AF specificity: 97.7%), although the differences across window lengths were modest. At the patient level, AF burden correlation was high across all window lengths, with Pearson of 0.995, 0.991, and 0.989 and Spearman ρ of 0.988, 0.982, and 0.979 for the 1.5-, 3-, and 6-minute models, respectively. Conclusions: The RRI-based 2D convolutional neural network achieved high AF classification accuracy and strong patient-level correlation with reference AF burden. Using RRI features and a 3-class framework, which explicitly separated noise from AF and non-AF rhythms, a 3-minute RRI window provided a favorable balance of performance for AF screening in a garment-type Holter ECG.

Clinically Interpretable Deep Learning for Differentiating Vitiligo and Postinflammatory Hypopigmentation: Diagnostic Accuracy Study

2026-07-24T15:30:14-04:00

Background: Distinguishing vitiligo from postinflammatory hypopigmentation (PIH) is clinically challenging because both conditions may present with similar depigmented lesions. Although deep learning has shown strong potential for dermatologic image classification, limited interpretability remains a barrier to clinical adoption. Objective: This study aimed to develop an interpretable deep learning framework for accurate differentiation between vitiligo and PIH using a lightweight convolutional neural network and an ensemble of explainability methods. Methods: A total of 332 clinical images (176 vitiligo and 156 PIH) were collected from King Abdullah University Hospital and publicly available online sources. Images were preprocessed and evaluated using patient-wise 5-fold cross-validation to eliminate patient-level data leakage. A pretrained MobileNetV2 model was fine-tuned by unfreezing the final 30 layers. To enhance interpretability, gradient-weighted class activation mapping (Grad-CAM), integrated gradients, and smooth gradients (SmoothGrad) were combined into an equal-weight ensemble explanation framework. Performance was assessed using accuracy, precision, recall, -score, and the area under the receiver operating characteristic curve (AUC). Results: The proposed model achieved an overall accuracy of 94.88%, macroaveraged precision of 94.88%, recall of 94.84%, -score of 94.86%, and an AUC of 0.9885 across the 5 validation folds. The ensemble framework produced clinically meaningful explanations in 98.48% of a representative 66-image validation subset used for interpretability analysis. Conclusions: The proposed framework combines high diagnostic performance with robust interpretability for distinguishing vitiligo from PIH. By integrating multiple complementary explanation methods, the approach enhances clinical transparency and may support dermatologists in the differential diagnosis of pigmentary disorders.

From Paper to Digital Medical Documentation in the Field: The Rapid Development and Deployment of the Digital Casualty Card System During a War

2026-07-24T15:00:03-04:00

Background: The accurate documentation of medical treatments for combat-injured personnel has historically posed significant challenges for prehospital medical teams. During recent US Army conflicts in Afghanistan and Iraq, only 18%-25% of casualties had any form of prehospital documentation. In the Israel Defense Forces (IDF), traditional manual and paper-based documentation has proven inefficient. During the 2014 Israel-Hamas conflict in Gaza, the completion rate for full documentation was notably low; only 11% (82/704) of casualties had casualty cards from the field. The sudden outbreak of the 2023-2024 Israel-Hamas war required an immediate re-evaluation of battlefield medical documentation practices. The IDF identified an urgent need for an innovative documentation approach to address the challenges of managing and tracking casualties in high-pressure scenarios. The integration of this system underscores the importance of real-time, robust documentation in improving continuity of care, minimizing medical error, and enhancing operational efficiency. Objective: This study outlines the rapid development and deployment of the Digital Casualty Card System (DCCS), designed to enhance the accuracy and efficiency of field documentation by medical teams during the 2023-2024 Israel-Hamas war. Methods: The DCCS was designed to streamline real-time medical data capture, enhance information transfer along the evacuation chain, and improve battlefield casualty care. A strategic decision was made to prioritize rapid deployment by focusing on a user-friendly, digital application, deliberately excluding advanced features such as sensor integration and real-time data transfer between echelons. This system became operational within 2 weeks of the project’s initiation and comprises military-grade tablets embedded with a dedicated software application for documenting casualty status and plastic memory cards worn around the casualty's neck for data transfer between medical teams. This study uses patient data from the IDF Trauma Registry, relying on data from point of injury casualty cards (DCCS), after-action reports, and data entry by on scene and en route providers. Results: Overall, since the beginning of the distribution, over 700 DCCS kits were embedded in combat units, medical evacuation units, and training units. During the ground operation in Gaza, out of 2984 casualties, 1175 (39%) arrived with DCCS documentation. Conclusions: The rapid development and deployment of DCCS during the ongoing war proved to be feasible and contributed to the substantial improvements in both documentation rates and the quality of data collected in the field compared to traditional paper-based casualty cards.