TrendNCart

close

Large language models as information providers for appropriate antimicrobial use: computational text analysis and expert-rated comparison of ChatGPT, Claude and Gemini

[ad_1]

Discussion

The findings of the present study showed a general positive performance gradient from ChatGPT and Claude to Gemini. ChatGPT 3.0 performed the lowest, in line with expectations, considering it is the oldest technology in the current fast-paced development environment.35 In detail, regarding CTA, G4E is the only one to produce significantly longer paragraphs than baseline G3E. Gemini outperforms other models in lexical diversity, readability and positive sentiment. Claude 2.0 performed worst in terms of lexical diversity and showed unbalanced responses (very high or very low) across prompts in terms of sentiment. Readability differences and lexical diversity analysis indicate how Gemini-generated outputs are more suitable and accessible for the general population’s health information needs. An important consideration is that no LLM produced ‘easy’ or ‘standard’ level reading material, so the minimal education level to fully access this material corresponds to 15 years-of-age schooling level (eg, high school), which can be an obstacle in terms of accessibility for some groups represented in the general population in search of healthcare advice. The study methodology ensured these findings were not primarily influenced by the difficulty of the prompts developed by expert researchers. GME exhibits the notable capacity of self-inhibition after consideration of the context and topic on which it was prompted. In the context of literature, studies have consistently shown that readability remains a major challenge, with mixed findings from GPT-4 and Gemini, along with other models such as Bard or Grok.36–41

The quantitative scoring analysis results are in line with CTA findings. ChatGPT-4 achieved the highest overall information quality scores. In literature, GPT-4 demonstrated the highest overall accuracy, with various models like Gemini Advanced and ChatGPT-4 showing significant outperformance over ChatGPT-3.5 in specific specialties. While Claude 3 had strong performance in some areas, it often lacked citation support, and ChatGPT models, despite their accuracy, sometimes showed less reproducibility.42–48 The general tendency of LLMs to produce generally positive or sentiment is noted, probably due to the ‘alignment’ process towards ethical values and non-discriminatory content received in the training phase.49 In our study, we found that GME demonstrated the highest sentiment score, while CDE exhibited many highly positive-scored text and highly negative-scored text. These technologies are designed to avoid generating extreme emotional responses, especially negative ones, an advantageous feature for medical applications where both a mix of emotional tones and factual, unbiased communication styles are needed.50 Gemini exhibited a predominantly positive sentiment compared with ChatGPT in a study.51 AI-generated essays contained more language related to affect, authenticity and analytical thinking compared with student-written essays.52 53

Regarding national language analysis, the results highlight how LLMs tested in English generally outperform their Italian-trained counterparts. Our findings align with prior research demonstrating significant variations in ChatGPT-4’s performance across different languages, particularly favouring English over less-resourced languages, due to stronger data availability and more extensive fine-tuning.54–56

Regarding persuasiveness, it was observed that ChatGPT-4-generated messages were reported as more persuasive than human-generated messages on some influencing factors, like untoward effect and stigmatised perception regarding human papillomavirus vaccination.57 Moreover, ChatGPT demonstrated significantly higher performance than the general population on all the Levels of Emotional Awareness Scale, indicating its great potential for health behaviour modification.53

Finally, with specific regard to AMR impact and infectious disease management, similarly low scores for all LLMs were observed. This is in line with literature findings. ChatGPT-3.5 obtained generally correct and safe, while often being incomplete for questions regarding infectious disease pharmacotherapy.22 Montiel-Romero et al evaluated ChatGPT’s reliability in antibiotic prescription decisions by comparing its recommendations with those of infectious disease specialists, finding only moderate agreement (51%) in antibiotic choices18 and fair agreement (42%) in identifying resistance mechanisms.18 On the other hand, ChatGPT was tested on assisting in documentation, patient communication and medical education,19 and creating culturally and linguistically tailored AMR awareness messages was explored,20 with mixed findings. Concerns about data privacy, security and hallucinations in AI-generated responses necessitate human oversight. Also, the quality varied significantly across languages. Giacobbe et al reported how the informational quality of the generated responses is suitable for the general public but not necessarily useful to professionals.21

Careful consideration should be given to the public health implications, both benefits and risks, of using these tools across general and professional populations. Our findings suggest that LLMs provide medical information of reasonable quality. However, their uncritical use should be avoided, as they are not a substitute for clinical judgement, especially in infectious disease and antibiotic management and its related complexity. Appropriate technical use and human oversight are essential to mitigate risks like misinformation from internet sources58 and ensure that LLM outputs, which vary in readability and reliability across models and languages, are effectively understood by the general population. As no single LLM consistently excels across all medical domains, collaborative human-AI frameworks must be adopted.

Strengths and limitations

A limitation of this study is that the evaluated LLMs correspond to versions that precede those currently available. However, these models share the same general structure and are closely aligned with the latest versions. Moreover, technological advancements in this field progress more rapidly than rigorous validation studies with sound methodology. Another limitation is that reproducibility across prompts was assessed empirically, without controlling hyperparameters such as temperature or seed values to standardise outputs. However, this can also be seen as a strength, as the chosen LLM interaction approach closely mirrors real-world user interactions. For instance, Gemini’s self-inhibition mechanism would not have been captured through remote application programming interface interaction but was observable only through simulated live interactions. Another limitation consists of having restricted CTA to English, which, though appropriate, might limit useful information regarding Italian-based LLM performance.

A key point concerns Gemini providing only five outputs for evaluation. While this self-inhibition demonstrates a valuable aspect of responsible information dissemination, it also poses a major limitation, as the small sample size may limit the generalisability of the findings for this LLM. Nonetheless, Gemini was evaluated through a comprehensive, multidimensional approach, and the results from these five outputs yielded statistically significant differences, indicating the sample was still adequate to detect relevant contrasts. On a further note, as one of the three major LLMs, Gemini’s inclusion is essential for a complete comparison with ChatGPT and Claude. Moreover, Gemini’s self-interruption after five scenarios is itself a finding that warrants sharing with the scientific community. A strength consists of having tested LLMs with prompts in both Italian and English. Additionally, a large number of scores were collected, which contributed to the statistical significance of the observed differences. Agreement among raters, which, while slightly above the threshold for substantial acceptability, indicates a reasonable level of consistency in evaluation. The observed inter-rater agreement, though acceptable, may reflect subjective differences in perception or challenges in assessing certain aspects, such as AMR impact or persuasiveness. To account for this and ensure reliable ORs, the study deployed a mixed-effect ordered logistic model with random effects for the ‘scores’ variable. Finally, our study for the first time also considers the paediatric population, proposing clinical-diagnostic questions related to antibiotic therapies in children. This finding is of particular interest given the possible influence of digital technologies and AI on parents’ diagnostic-therapeutic decisions.

[ad_2]

Source link

Leave a Comment

Your email address will not be published. Required fields are marked *