Large Language Model vs. Human Experts

Posted on 2024-11-28 :: Tags: large language model, llm

I've randomly stumbled upon this article, Large language models surpass human experts in predicting neuroscience results. Should we worry about this?

Tests were drawn from BrainBench, testing/checking neuroscience knowledge.

Co-authors (Supplementary Table 5) and GPT-4 (Azure OpenAI API; version 2023-05-15) created test cases that formed BrainBench. All test cases were sourced from Journal of Neuroscience abstracts published in 2023 under the Creative Commons Attribution 4.0 International License (CC-BY). The abstracts are organized into five sections, namely, behavioural/cognitive, systems/circuits, neurobiology of disease, development/plasticity/repair and cellular/molecular.

LLMs outperformed human experts. LLM ($>$ 0.8 accuracy) vs. predoctoral student ($\approx$ 0.65), doctoral student ($\approx$ 0.6), postdoctoral researcher ($\approx$ 0.65), faculty ($\approx$ 0.65).
- Shouldn't we (humans) be more critical in paper reading?
- Should we update neuroscience textbooks?
- Not so significant. But what happened to the doctoral students?
Human expertise not necessarily aligns with LLM expertise.

Perplexity measures how surprising a text passage is to an LLM. Using these measures (Supplementary Fig. 6), the mean Spearman correlation between an LLM and human experts was 0.15 ($\pm$0.03), whereas the mean Spearman correlation between LLMs was 0.75 ($\pm$0.08).

No indication of brute-force memorization...

We found no indication that BrainBench was memorized by LLMs (Supplementary Fig. 7). ... As a final check (Methods and Supplementary Fig. 8), we confirmed that LLMs do not perform better on items published earlier in 2023 (for example, January 2023 versus October 2023), which addresses the concern that early items are more likely to have a preprint or other precursor appear in the training set that affected BrainBench performance.

Github repo: https://huggingface.co/BrainGPT