GPT-4 is getting dumber before our eyes: researchers record a clear degradation of the language model
Will the era of neural networks end before it even begins?
Last Tuesday, researchers from Stanford University and the University of California at Berkeley published a joint paper stating that model responses GPT-4 change over time. This work feeds the common but as yet unproven notion that the performance of the popular natural language AI model has degraded dramatically in many tasks over the past few months.
In a study titled How Behavior Changes ChatGPT over time?” published on arXivLingjiao Chen, Matei Zachariah and James Zou expressed doubts about the consistently high performance of large language models Open AIspecifically GPT-3.5 and GPT-4.
Using access through API, experts tested the March and June 2023 versions of these models on tasks such as solving math problems, answering sensitive questions, code generation, and visual thinking. In particular, the ability of GPT-4 to determine prime numbers, according to the researchers, plummeted from an accuracy of 97.6% in March to just 2.4% as early as June. But what is strange is that the GPT-3.5 model in most tasks, on the contrary, showed improved performance over the same period.
Comparison of accuracy of GPT-3.5 and GPT-4 responses in March and June
Scientists were motivated to conduct such a study shortly after people began to complain en masse that the performance of GPT-4 allegedly began to gradually decline. One of the popular theories about possible causes involves artificial performance limitation by OpenAI itself to reduce computational costs, increase responsiveness, and save GPU resource. Another fun theory is that GPT-4 was made “stupid” by people who just often ask stupid questions about it.
Meanwhile, OpenAI has consistently denied any claims that GPT-4’s capabilities have deteriorated. Just last Thursday, OpenAI VP of Products Peter Welinder wrote in his Twitter *: “No, we didn’t make GPT-4 dumber. Quite the contrary: we make each new version smarter than the previous one.”
While the new study may seem like strong evidence of the GPT-4 critics’ hunches, other experts are sure it’s not a good idea to jump to conclusions. Professor of computer science at Princeton University Arvind Narayanan believes that the results of the study do not unequivocally prove the performance degradation of GPT-4. In his opinion, it is likely that OpenAI simply fine-tuned the model, as a result of which the model began to behave better in a number of tasks, and worse in others. But this is also inaccurate.
One way or another, the widespread claims of GPT-4 performance degradation forced OpenAI to conduct its own investigation. “The team is aware of the reported regression and is investigating the matter.” informed this Wednesday, Logan Kilpatrick, Head of Development at OpenAI.
Perhaps bright minds on the side could help the developers of OpenAI figure out the reason for the regression of their system, but the GPT-4 source code is closed to third-party developers, for which the company is blamed at every opportunity.
OpenAI does not disclose the sources of GPT-4 training materials, source code, or even a description of the model’s architecture. With a closed “black box” like GPT-4, researchers are left “wandering in the dark” trying to determine the properties of the system, which may have additional unknown components. In addition, the model may change at any time without notice.
AI researcher Dr. Sasha Lucioni from Hugging face also considers the opacity of OpenAI problematic: “Any results on closed models are not reproducible and unverifiable. Therefore, from a scientific point of view, we are comparing raccoons and squirrels.”
Lucioni also noted the lack of standardized criteria in this area that would make it easier for researchers to compare different versions of the same language model: “They should actually provide raw results, not just general metrics, so we can see where they are good and how they are wrong.”
Artificial intelligence researcher Simon Willison agreed with Lucioni: “Honestly, the lack of release notes and transparency is perhaps the biggest problem here. How are we supposed to build reliable software on a platform that changes in completely undocumented and mysterious ways every few months?
Thus, while the above research paper may not be perfect and even contain some flaws in the way OpenAI models are tested, it raises important questions about the need for greater transparency and reproducibility of results when releasing updates to large language models. Without this, developers and researchers will continue to face uncertainty and difficulty in dealing with these rapidly changing black boxes of artificial intelligence.
* The social network is prohibited on the territory of the Russian Federation.