Introduction: Recently, many groups have studied the performance of general large language models (LLMs) such as ChatGPT (OpenAI) on various professional examinations and certifications. However, performance on the Royal College of Physicians and Surgeons of Canada written examination in Neurosurgery has never been reported.
Methods: The Canadian exam is comprised of 36 question stems with multiple short-answer questions (6 hours). A passing grade is 70%. Most questions are text-based, while some require radiological or histological interpretation. A set of 18 practice question stems was prepared by the national Ottawa Review Course committee and administered to 41 final year (PGY-6) residents and fellows < 2 months prior to their examination. The questions were then administered to GPT-3.5 and GPT-4. Text-based questions were inputted without modification. However, because public versions of GPT are not capable of image interpretation, image-based questions were entered one of two ways: unmodified (without prompting) and modified with a written description of the image (with prompting). Responses from both human- and AI-participants were graded by a faculty member using a pre-established rubric.
Results: The mean grade of the PGY-6 residents and fellows was 59% (162/264), with a range of 28–74%. Without prompting, GPT-3.5 scored 48% (132/274) while GPT-4 scored 68% (/274) (p < 0.00001). Without prompting, GPT-4 outperformed the average resident (68% vs 59%, p < 0.0001). With written prompting for image-based questions, GPT-3.5 scored 63% (160/254) and GPT-4 scored 76% (194/254) (p < 0.01). With prompting, GPT-3.5 scored similarly to the residents (63% vs 59%, p = 0.36) and did not pass, while GPT-4 outperformed the average resident (76% vs 59%, p < 0.0001) and achieved a passing grade.
Conclusion : On a practice exam designed to approximate the Canadian Royal College written examination in Neurosurgery, GPT-4 achieved a passing grade and performed significantly better than both GPT-3.5 and a cohort of PGY-6 residents and fellows. Further in-depth analysis will be necessary to analyze any difference in factual recall vs clinical judgment type questions.