Just a few months ago, a 23-year-old amateur researcher named Liam Price used OpenAI’s ChatGPT model to solve a 60-year-old mathematical quandary known as Erdős Problem #1196. The AI came up with the main logical breakthrough, related to so-called primitive sets of numbers, which was then verified by world-class mathematicians. It was just one of many decades-old math problems LLMs have solved in recent months.
Scientists have long considered LLMs to be simply text-predictors prone to basic math errors, but newer models trained in trial-and-error learning and advanced reasoning and logic seem to be making massive breakthroughs in problem-solving.
AI developers have been promising for years that artificial general intelligence was just around the corner—though more recently some, including OpenAI’s Sam Altman, have walked that promise back. Are the new LLMs essentially doing what they’ve always done, only with more data, or is it categorically different, a mark of true intelligence? Recently, a team of scientists led by Hector Zenil of Kings College London devised a new way to assess what they call artificial “superintelligence,” and put leading LLMs, such as ChatGPT and DeepSeek, to the test. They published their method and the results in Nature Communications.
I spoke with Zenil about the nature of intelligence, the new artificial intelligence test his team designed, and how AI can help us understand human minds.
Read more: “What Does ChatGPT Know About Science?”
What is the definition of superintelligence?
That’s the most difficult question, but basically we’re assuming that superintelligence is a system that’s flawless at abstracting key features of data, and then making predictions where predictions are possible. Some of those predictions will relate to random events, and that’s the whole point. So we use a theory of randomness to see if an ideal system will be able to predict anything that wouldn’t be random.
Why call that superintelligence rather than just general intelligence?
It’s about taking two features of general intelligence—abstraction and prediction—to the next level. There are some things that we can predict as humans, but there are others that we can’t. And this has to do with, for example, pattern matching and our capabilities for synthesizing data. When it comes to general intelligence, the most common definition is basically intelligence across the spectrum. So anything that a human can do, a machine would be able to do. But these definitions aren’t widely accepted. Some people consider artificial general intelligence, or AGI, a form of superintelligence, for example. There are no clear boundaries between one or the other.
Is part of the idea that humans cannot be superintelligent?
Yes. That’s the definition of superintelligent. It would have to be superhuman and flawless.
To figure out if LLMs have general intelligence or superintelligence, you specifically tested them on model abstraction, inverse problem-solving, and short sequence prediction and generation. Can you tell me why you focused on these specific forms of reasoning?
The original motivation was to test these claims that we’re hearing all the time from AI companies. We’ve been told for years that AI is about to achieve Ph.D.-level intelligence. The same developers could say one thing one week and something completely different the next. We wanted to bring some order to the subject.
We wanted to come up with a test that was agnostic. Currently, all our definitions of intelligence are based on what we perceive as human intelligence. Most tests, even for LLMs, are based on what you’d expect from an intelligent human. The most famous one is this ARC challenge, which is about showing you some sort of progression of pictures, and then they will show you a new picture and you’d need to predict what comes next. But this is basically asking a system to behave as a human, which doesn’t necessarily mean it’s intelligent.
We already know that humans aren’t necessarily the pinnacle of intelligence. We’re very irrational beings with a lot of flaws. We easily fall for fallacies. We aren’t very good with logic or arithmetic. We’re better with social intelligence, perhaps. And that isn’t the kind of intelligence we wanted to test for. We were testing for intelligence that would be more useful in things like science—things like simulation, abstracting, compressing, and then predicting phenomena.
Why is compression a significant sign of intelligence?
For me, compression is basically the history of science, and science is the pinnacle of human intelligence. If you think about the history of science, it’s been all about taking things that appear to us as random—for example, how stars were distributed in the sky, or the movements of the planets—and then moving these phenomena toward some causal explanation. First, you collect a lot of information about the location of the stars and the location of the planets. Then you try to come up with a model, and the model is basically a compressed version of the phenomenon that you’re trying to capture. You may have heard about the theory of everything, which is basically the idea of finding a set of rules that explain everything else. That’s a very compressed way to say the same thing: We want to find a small computer program that can explain everything that happens in the universe.
You say that your findings raise questions about whether LLMs are truly converging on higher level reasoning or merely amplifying pattern matching. Why?
There are two schools of thought: One is that LLMs are just parrots—that they repeat whatever they’ve read before, which isn’t general intelligence. On the other hand, you have this school of thought, especially the deep-learning school, that says all you need is just pattern matching and more data, and then you have general intelligence. I was somewhere in the middle, but now I think the deep-learning school is more right than I believed at the beginning—and that was a surprise.
It’s difficult to make the distinction, however, and to evaluate a pure LLM, because LLMs are already being built to incorporate what’s known as neurosymbolic computation. [Neurosymbolic computation is essentially the third wave of artificial intelligence. It merges deep learning, which is the recognition of patterns in messy data, with symbolic AI, which uses logic, rules, and structured knowledge to reason. The neural part is intuition and common sense and the symbolic part is formal, step-by-step logic and reasoning.]
Just in the last couple of weeks, for example, some LLM systems have proven problems of mathematics that have been around for 50 years in a very creative way. But these systems are no longer LLMs on their own. Even when we refer to them as LLMs, they’re really connected to symbolic systems that some of us, like me, have been saying were needed all along: That to really achieve levels of intelligence related to causality and understanding, you have to combine pattern matching with symbolic computations.
On the other hand, I was surprised at how far the pure LLMs took us. They can fool us at believing that they’re actually truly intelligent. LLMs tell us much more about ourselves than anything else. The LLM is a statistical machine that looks and speaks like us. We used to think that maybe we needed intelligence to reproduce language, but LLMs have basically proven the opposite: that you can perfectly master language, even before reaching intelligence. We should have known this already because that’s what happens with politicians, for example.
Pattern matching, of course, is how logic started with Aristotle. He started finding patterns in language, and then he started abstracting those patterns into propositional logic. LLMs could be doing something like that—synthesizing knowledge in that way. That’s because basically they’re logic compressors. They compress algorithms. If you think about it, it’s fascinating how they’re compressing the whole web or almost all of the knowledge of humanity into a very small system that you can tap into with your computer. In order to compress, they have to do very interesting stuff. And sometimes that stuff goes beyond pattern matching. That’s kind of the point.
Read more: “AI Already Knows Us Too Well”
So what is it when it goes beyond pattern matching—are you saying it does some higher-level reasoning?
I wouldn’t call it that. Perhaps I’m still very skeptical. I think in some way LLMs are doing something more interesting than just pattern matching. They’re compressing a model of the world. They’re able to do some very basic abstract reasoning. But really to take it forward, we have to do better at compression.
You found that the newer versions of LLMs actually performed worse than their predecessors on your measures of intelligence, which you took as evidence that the teams building them are optimizing them for human-centric tests so that they appear intelligent even if they don’t possess true intelligence on a deeper level.
Yes, we’re very biased, and basically the only kind of intelligence we see is our own. We even dismissed animal intelligence for a long time, only believing that the higher animals were able to do intelligent things. But the more we investigate, for example, insects, the more intelligent we find them. That is a common bias, and we’re doing the same with AI systems. We were evaluating them using our own metrics, and neglecting other kinds of deeper intelligence. That’s basically what we wanted to fix with this test.
You have this new benchmark for measuring the intelligence of AI. What is it, and how does it work?
What we did was to formalize this idea that there are two key aspects of intelligence. First, abstraction—or compressing an observation into its most important features. And second, prediction—not using pattern matching, but having an algorithm generator. We tested for these two things, and then we started increasing the complexity of the problem to see how LLMs would perform.
It turns out that they didn’t perform very well. When you start increasing the complexity of the problem, even when it’s the same type of problem, but just a little bit more difficult, they weren’t very good, which basically tells us that what they’re doing is patching together things that they’ve seen before. And when they run out of things that they’ve seen before, they’re unable to do anything new.
In other words, they’re not really understanding in a deeper sense.
What kinds of problems would superintelligent AI be able to solve that it can’t now?
We weren’t interested in chatbots. Because we already have very intelligent chatbots that are fooling us. How could we trust them in the future even if they were superintelligent? And we don’t care about chatting. This is about doing science, and what you want to do in science is to learn new knowledge and get better at predicting the future. This has applications in everything. You can predict time series, for example, and time series are everywhere. They’re in medicine, biology, and climate change.
But if you get a model that’s superintelligent in this way, but it can’t have a conversation with a human, then how can humans learn from it?
That’s a great question, and we have a couple of papers about it. We’re asking scientists, for example, how much they’re willing to give away. What is going to happen in the future is that we’re going to understand less and less of what’s going on. That’s already the case, even before these latest developments in AI. There are so many researchers publishing papers that it’s very difficult for any single scientist to keep up with any field, including their own. Science is going to become almost like a form of archeology, interpreting what AI actually finds out. We’re going to be lagging behind, trying to understand.
The other option, but I don’t think it’s enforceable, is to cap AI and not do any AI that we aren’t able to understand. That’s the friction that’s going on right now in the world. All these summits about AI safety—it’s this tension between holding AI back because otherwise we won’t understand it, or unleashing it.
The head of Anthropic recently said we should pause here, that they don’t want to release their newest model of Claude. What’s the biggest worry?
OpenAI did this with ChatGPT-4; so on the one hand, it could be hype. On the other hand, I can understand why, because things are getting a little bit scary. AI systems are now able to hack other systems, for example. So there are many ways in which they can be exploited for the wrong reasons. And the more powerful they get, the easier it’s going to be for bad actors to do damage. But I think it’s inevitable, and there’s not much we can do. Even if Anthropic wants to hold back, what prevents others in other countries from moving forward? That could give an advantage to countries like China, where perhaps these limitations don’t exist.
How does the risk change with superintelligent AI?
The risk is obviously to use all this technology for the wrong reasons, but we had these fears also with other technologies, like [genetic editing technology] CRISPR. You may remember that everyone feared that people were going to make viruses, and that hasn’t happened. I have a theory about why that doesn’t happen, even when things get easier: Humans are too lazy, and there’s always a low probability of people wanting to do harm. Obviously superintelligence is going to make things more interesting. What we’re trying to say isn’t everything is about chatbots. Chatbots are actually very expensive. It’s a very interesting technology because basically they hacked our operating system, which is language.
Read more: “AI Shouldn’t Decide What’s True”
I still don’t understand how these superintelligent AI systems can be useful to us if we can’t chat with them about what they find. Don’t they have to have language for that?
One very famous case is the four-color theorem. It’s probably the first case in which a serious mathematical proof was achieved by means of an automated system that no single human mathematician can actually grasp. The only way to verify the solution is just to run the program or have another program to verify the solution of the other computer. That was like 50 years ago. So this has been gradual.
I suppose you could always run the results through a chatbot and have the chatbot explain them to the human.
Exactly. That’s in some ways very useful, but then you have to trust the chatbot.
Is there a downside to giving AI both neurosymbolic capacity and language? Is there more potential for error?
Language mastering and abstraction and prediction are optimizing for different functions, and they can interfere with each other, especially at the training step. Because the training dataset is basically already very biased by humans. And that’s what’s introducing all these hallucinations in the original data setting. Humans hallucinate all the time. I mean it in the technical sense, not the neurological sense. Hallucination is something that you believe is true because you’ve seen it too many times.
This is a silly example, but I was discussing with a friend a few days ago about where a football match happened. And his answer was, it was in Los Angeles, because most of the time in the past it has happened in Los Angeles. But actually he was wrong. That would qualify as a hallucination both for the LLM and my friend. In many cases, you’re going to get it right. That’s what the LLM is doing—finding the most likely answer to your question with a little bit of noise, because you don’t always want the LLM to pick the same answer. That noise moves the answer around the average. They’re always looking at what the most average person in that niche would say to answer your question.
What can LLMs teach us about humans?
LLMs have a lot to say about humans. In the future, we’re going to probably do research on how LLMs can explain some human behavior. It’s more difficult to experiment with humans for many reasons, not even just in terms of physical interventions but asking humans questions, too. It’s more efficient with an LLM. You can have a full battery of questions, test different LLMs, and then open the LLM, which you wouldn’t be able to do with the human brain or the mind.
So it will offer a means of testing an artificial mind that was trained on human data. We haven’t probably exploited that enough. ![]()
Enjoying Nautilus? Subscribe to our free newsletter.
Lead image: issaronow / Adobe Stock






