[ad_1]
Abstract
Objective The study aimed to evaluate the top large language models (LLMs) in validated medical knowledge tests in Portuguese.
Methods This study compared 31 LLMs in the context of solving the national Brazilian medical examination test. The research compared the performance of 23 open-source and 8 proprietary models across 399 multiple-choice questions.
Results Among the smaller models, Llama 3 8B exhibited the highest success rate, achieving 53.9%, while the medium-sized model Mixtral 8×7B attained a success rate of 63.7%. Conversely, larger models like Llama 3 70B achieved a success rate of 77.5%. Among the proprietary models, GPT-4o and Claude Opus demonstrated superior accuracy, scoring 86.8% and 83.8%, respectively.
Conclusions 10 out of the 31 LLMs attained better than human level of performance in the Revalida benchmark, with 9 failing to provide coherent answers to the task. Larger models exhibited superior performance overall. However, certain medium-sized LLMs surpassed the performance of some of the larger LLMs.
Introduction
The emergence of large language models (LLMs) has prompted discussions on their potential in the medical field. These advanced models demonstrate significant potential in areas such as disease management,1 decision-making2 and medical research.3 Despite their promising capabilities, existing research predominantly concentrates on datasets in Chinese4 and English,5 with limited attention given to multilingual models6 and less commonly spoken languages. This poses a substantial problem as over half of the global population, including around 293 million Portuguese speakers, is not represented in English-centric datasets, potentially leading to health inequities in the deployment of LLMs in medicine. Global inequities in medicine are widespread, particularly in countries where English is not the primary language. Despite international efforts to reduce health disparities, progress has been uneven and often hindered by the slow advancement towards universal health coverage.7 In this context, technology can play a crucial role in addressing these disparities.8 Therefore, deploying LLMs in healthcare could be a powerful tool to help mitigate the existing inequalities. Given the considerable variability of medical knowledge across diverse cultural contexts, particularly evident in language diversity, this study seeks to develop a benchmark, specifically in Portuguese, for assessing the medical knowledge of the top 31 LLMs in a non-English and non-Chinese scenario.
Results
Table 1 displays the performance of each LLM based on our dataset. We evaluated each LLM by running all 399 questions five times to account for LLM randomness, thereby achieving a 99% CI (based on SD). The results exclude all Apollo, Meditron, Yi and Gemini 1.5 Pro models due to their lack of coherence in responses. These models produced outputs without meaningful connections to the questions asked. Among the open-source small models, Llama 3 8B achieved the highest success rate of 53.9%. In the category of medium-sized models, Mixtral 8×7B achieved a success rate of 63.7%. Transitioning to the large-scale models, Llama 3 70B instruct demonstrated a success rate of 77.5%. Among the proprietary models, GPT-4o achieved a score of 86.8%, while Claude Opus achieved 83.8% success.
Discussion
This study examined the performance of the top 31 LLMs in responding to Portuguese questions within a medical context. The results indicated that ten models exceeded the highest exam’s cut-off and human average score of 67%, including ChatGPT-4o, which achieved a score of 86.8%. Additionally, 12 models scored below the human average, and 9 models were unable to generate coherent answers for the proposed tasks. It is important to note that these 31 models represent the highest performing models in test taking. It is notable that all responses from proprietary models consisted of single letters, allowing for straightforward text comparison to evaluate the models’ outputs. Similarly, the larger open-source models predominantly provided single-letter responses. Conversely, smaller models often generated more complex text responses, suggesting a failure to accurately interpret the command prompts. Ultimately, 227 out of 8778 questions were excluded from the results due to a lack of agreement between ChatGPT-4 and Claude Opus in the correction process. Among the open-source models, Llama 3 70B achieved the highest performance. Along with Qwen1.5 72B and Mixtral 8×22, these open-source models outperformed the human test takers. As expected, the proprietary larger models, GPT-4o and Claude Opus, exhibited the best performance. Additionally, Gemini Pro 1.0 and the other Claude models also outperformed the human average. The companies behind these models do not disclose the number of training parameters, making it challenging to analyse the performance of each LLM relative to its size. However, it can be estimated that both GPT-4o and Claude Opus are larger models compared with the others. Although the smaller models could not compete with the larger ones, Llama 3 8B and Claude Haiku demonstrated impressive performance relative to their training sizes. Notably, Claude Haiku, with approximately 20 billion parameters, surpassed the human average.
Conclusion
Both proprietary and open-source LLMs have achieved satisfactory performance on a standardized national test evaluating medical knowledge among physicians in Brazil, often surpassing the human test takers. In general, although larger LLMs tended to perform better, some medium-sized LLMs (Llama 3 70B and 70B instruct, Claude Haiku and Claude Sonnet) were competitive, outperforming some of the larger LLMs.
The Portuguese benchmark tool is now implemented and available for use by the scientific community. For future investigations, it would be important to compare the performance of the same LLMs on the Revalida Benchmark with English-written benchmarks. This comparison would enable a thorough analysis to determine if there is a bias in the advancement of LLMs outside the English and Chinese contexts. Finally, it would be valuable to investigate the impact of methods, such as RAG, on the LLM models used in this study when applied to the Revalida benchmark.
[ad_2]
Source link



