Are LLMs on the path to AGI?

written July 27, 2024 in llms, machine learning

The claims of artificial general intelligence, or AGI, have been some of the hottest and most emotionally charged discussions about LLMs. In addition, these claims probably have the most intellectual weight behind them of all of the claims about LLM abilities. For example, in March 2023, Microsoft research released a paper titled “Sparks of Artificial General Intelligence”, where they claimed that a series of experiments on GPT-4 showed that this model was showing at least some signs of AGI. Later that year, researchers Blaise Agüera y Arcas (an AI research fellow at Google) and Peter Norvig (a Distinguished Education Fellow at the Stanford Institute for Human-Centered AI) published a piece stating that LLMs had already crossed the threshold for AGI.

So with these academic heavy hitters, plus many others, throwing their weight behind the idea of LLMs showing AGI, how can we be sure that these models haven’t developed intelligence yet?

This and the other two blog posts in this series are based on a keynote I delivered at PyCon Italia this year, as well as a talk at NDC Oslo, where I debunk some of the more outrageous claims about LLMs demonstrating human-level traits and behaviours. If you also want to give the whole talk a watch, you can see it below. You can also read the previous posts in this series, the first where I discuss why LLMs are not using language in the same we do as humans, and the second where I explain why LLMs are likely not sentient.

The work in this blogpost is based on the following three excellent articles: “On the Measure of Intelligence” by François Chollet, “GPT-4 and professional benchmarks: the wrong answer to the wrong question” by Sayash Kapoor and Arvind Narayanan on their excellent blog “AI Snake Oil” (full of many such good reads), and “Levels of AGI for Operationalizing Progress on the Path to AGI” by a team at DeepMind.

Haven’t we heard this story before?

In May of 1997, the then world chess champion Garry Kasparov was facing off against Deep Blue, IBM’s chess-playing AI. This was their second match, Kasparov having won the first one in February 1996. In the second game of the match, Deep Blue made an unexpected move, which rattled Kasparov and made him believe the algorithm was far more powerful than it actually was. Thrown off his game, he ended up losing the second game to Deep Blue, then drew every game until finally losing the sixth, losing the whole match in the process. The press went wild, speculating that if we could conquer chess with an artificial system, then surely AGI was around the corner. However, it’s been almost 30 years, and you can see how those predictions panned out.

AGI is here!

In fact, since we started developing advanced machines, these predictions that we’re on the cusp of developing an artificial intelligence have been regularly made. As far back as 1863, right at the start of the second industrial revolution, author Samuel Butler stated that the development and ascendency of an artificial general intelligence was inevitable, stating:

The upshot is simply a question of time, but that time will come when the machines will hold the real supremacy over the world and its inhabitants is what no person of a truly philosophic mind can for a moment question.

And mathematician Claude Shannon, interviewed only a few years after the development of the first neural net, predicted that AGI was only a few years away:

I confidently expect that within a matter of ten or fifteen years, something will emerge from the laboratories which is not too far from the robot of science fiction fame.

So as you can see, the current speculation around whether we’ve created AGI or are on this path is part of a long history. However, just because we’ve had false starts before, doesn’t mean we can just hand wave away the possibility that LLMs are an AGI. Instead, let’s have a closer look at the evidence.

Skill-based assessments of intelligence are misleading

The problem with trying to assess intelligence through whether a system does well at some complex task is that it confuses the output of an AI system with the mechanism the model took to get there. In humans, we can use skill-based assessments of intelligence, because we know that in humans, the ability to demonstrate significant skill in an area is reflective of an underlying raw general ability.

AI models - or, let’s call them what they are - machine learning models, don’t work this way. They will try to optimise for their training goals, and if they can take shortcuts to get to good performance on these goals, they will. The mistake is thinking that the pathway required to learn some skill requires the development of underlying intelligence.

This illusion of skill-based intelligence can be seen on websites like Kaggle. On this site, people compete to create a machine learning algorithm that can get the best performance on some specific task. The winning solutions are so good at doing this specific task that they seem to show “intelligent” behaviour. However, as soon as they try to predict on a data point too far outside of what they were trained on, they fail: they show brittleness.

This is a problem plaguing current assessment of intelligence in LLMs. They focus strongly on how well these models perform in specific tasks, ignoring how intelligence is defined and measured in humans. And this is largely because, while researchers making these claims are very established in their fields, their fields are things like engineering, mathematics and computer sciences - not psychology.

Let’s have a look at an example that shows why these skill-based assessments are a problem. One of the biggest hype topics at the beginning of last year was people finding that ChatGPT and GPT-4 could solve medical and law exams, with the resultant claims that it was going to replace lawyers and doctors. There were also examples of these models solving leetcode puzzles, with similar claims that programmers were going to be replaced. Let’s have a closer look at this last claim.

There was a great example circulating on Twitter around the time GPT-4 came out. Horace He tested out GPT-4 with some coding challenges from a website called Codeforces. What makes CodeForces really great for this little experiment is that the date that a particular problem was released is recorded. He collected 10 puzzles that were available when GPT-4 was trained, and ran them through the model, and what would you know, it got all of them right.

GPT-4 is coming for our jobs!

He then collected another 10 puzzles that were released after GPT-4 was trained, of an equivalent level of difficulty - and this time GPT-4 got every single one of them wrong. What happened here?

Maybe we're safe for a while yet

Essentially, all of the puzzles that GPT-4 solved were contained in the training set, and the model had simply memorized them. Sayash Kapoor tested this explicitly by prompting the model to tell him about Codeforces puzzles that were available at the time it was trained, and GPT-4 obliging vomited up the complete source, confirming that it contained CodeForces puzzles in its training data. Testing your model on the data it was trained on is considered a fundamental error when measuring model performance: of course your model will do well on a problem if it’s trained to solve that exact problem!

Memorising the answers

I think this is a very neat example that skill-based assessments of intelligence in LLMs can be wildly misleading. Instead, we need to look to how well these systems can solve tasks they’ve never seen: that is, how well they can generalise.

Generalisation: a potential path to AGI

This idea of the generalisability of AI systems is explored deeply by François Chollet in “On the Measure of Intelligence”, where he defines a hierarchy of generalisation with five levels, going from least to most general.

No generalisation: These are systems which do not generalise beyond the use case they were explicitly programmed to do, and includes systems which “know” all of the possible iterations or outcomes in advance. An example would be a program that uses all configurations of a tic-tac-toe game in order to play.
Local generalisation: This is where a system can make inferences on examples it hasn’t seen before, but only if they are similar to the examples it was trained on. Machine learning models demonstrate local generalisation: when they try to make a prediction for an example that is too different to what they were trained on, they will fail, as we discussed earlier with Kaggle entries.
Broad generalisation: This describes human-level ability in a single broad activity domain. A famous example of this is the Wozniak coffee test, proposed by Steve Wozniak of Apple fame. He proposed that a system being able to go into a kitchen and make a cup of coffee without assistance would show broad generalisation. Full automated self-driving vehicles would also fall under this category.
Extreme generalisation: This level is essentially human-level intelligence. This describes the ability of an artificial system to handle entirely new tasks that might only share abstract commonalities with previously encountered ones, and also apply these abilities across a wide scope of domains that humans might be reasonably expected to encounter.

So you might be thinking: you’re focusing a lot on human-level skill. But we’re talking about an artificial system: don’t we want it to go beyond what we can do, to be better than us? And that leads to the final level of generalisation:
Universality: This is the ability to handle any task within our universe, beyond the scope of tasks relevant to humans.

However, universality should be dismissed as at least an initial goal for artificial systems for a couple of reasons. Firstly, all systems need a scope to be useful, and the most immediate use case we have for AI models is to automate tasks normally done by people. Secondly, as you can see, we’re not even at the point of broad generalisation, so should focus on knocking off easier generalisation goals before we aim so high.

If we come back to intelligence, we can actually line up this idea of generalisation with the definition of human intelligence. The most widely accepted conceptualisation of human intelligence (the one that I learned during my PhD) is that humans have a general ability to learn, called general intelligence, or g. We use our general intelligence to learn broad activities, such as how to cook or drive a car, and within those broad abilities are the tasks that we can complete, such as whisking eggs or using an indicator. We can see an artificial system’s extreme generalisation aligns with g in humans, broad abilities align with broad generalisation, and task specific skills align with no generalisation or local generalisation.

Aligning human and machine generalisation

Given these similarities, it seems likely that we can derive lessons from measuring intelligence in humans, which is well-established field, and apply them to measuring intelligence in artificial systems. And as an aside, I’m aware that intelligence is a controversial area in psychology - however, the idea is that we can borrow ideas, like the ones above, which can help us create a definition that gets at the true generalisability of artificial systems.

What might a system with AGI look like?

In his paper, Chollet also gives us a blueprint for how we might be a system with artificial general intelligence.

A potential system with AGI

He defines it as follows: an artificial system should demonstrate the ability to complete a task, using knowledge encoded in a skill program relevant to the task, which is generated by a human-like intelligent system. I hope you can see how these align to the general intelligence, broad abilities and task-specific abilities in humans that we just saw. These skill programs are refined based on input about both the situation and the effectiveness of the response, and the intelligent system learns over time through exposure to more and more tasks.

Chollet also argues that if we’re focusing on the development of a human-like AGI, then we should also presume that such a system is designed to include the same innate skills that humans are born with: elementary geometry and physics, arithmetic, and the agency of other objects. Skill programs over time will be able to encode their experience with tasks and remember how they solved such problems before, and tasks can vary based on their generalisation difficulty. The generalisation difficulty of a task is how different the task is to previously encountered tasks - if the task varies greatly from things the system has been before, then the ability of the system to generalise must be greater to successfully complete the task. So obviously tasks high in generalisation difficulty are going to be the true test of such a system.

So what we have here is a starting point, a conceptualisation of how we might build a system with artificial general intelligence.

Measuring AGI

So how might we assess how well a system can complete each task? Chollet defines it as follows:

$$ \frac{1}{\text{All tasks in scope}} \cdot \left( \text{Value of achieving skill in task} \cdot \frac{1}{\text{All ways of solving task}} \cdot \left( \frac{\text{Generalisation difficulty}}{\text{Priors} + \text{Experience}} \right) \right) $$

Don’t worry if this looks a bit hairy: we’re going to break it down. On the right hand side of the equation, we have:

$$ \frac{\text{Generalisation difficulty}}{\text{Priors} + \text{Experience}} $$

What this describes is the difficulty for this system of solving one specific task using one specific approach. We can think of the difficulty of solving this task using a specific approach as the generalisation difficulty of the task, divided by how much it aligns with the innate system priors and how much experience the system has with solving the task in this particular way. Solutions that are further away from what the system has seen or done before will be more “difficult”.

There are obviously different ways of solving the task, so we can average the difficulty of solving this task over all possible ways of solving it, to give the average difficulty of doing this task, hence we multiply the first term by:

$$ \frac{1}{\text{All ways of solving task}} $$

Finally, we can then add in a subjective component ($\text{Value of achieving skill in task}$) which aims to capture how valuable or impressive it is that the system can do this task. It’s obviously much more impressive that a system can do brain surgery versus output a list of pseudorandom numbers.

We then average this across a range of tasks in the scope of what we would want an intelligent system to be able to do, so overall we have a measure of its skill-acquisition efficiency over a broad range of tasks, which we can think of as its generalisability or it’s artificial general intelligence. The real trick with building a working assessment using this system is defining a couple of quite tricky-to-define things, such as what tasks would need to be included in the scope to sufficiently measure generalisation, and how to assess the generalisation difficulty of tasks compared to what the system already knows.

Chollet went one step further and proposed a potential measurement for AGI based on his definition, called the Abstraction and Reasoning Corpus. This consists of a series of 100 test puzzles, for which the taker is given a handful of examples, and is asked to determine the rules set out in these examples to solve the final example. This is thought to require the test taker to form internal abstractions to understand the rules being set out in each problem - that is, to create a skill program to solve them. Here’s an example from the test: you can see for all three examples, that the idea is to change the colour of all squares contained in green squares from black to yellow.

ARC example

ARC is considered more robust than other ways of measuring intelligence in artificial systems for a couple of reasons. Firstly, the small number of examples means that a model can’t just brute force a puzzle by training on these. And secondly, the test set solutions have not been made publicly available, so they cannot be memorised by a model during training. While not a perfect tool, which Chollet will readily agree, it definitely goes beyond subjective assessments or tests designed for humans.

So, what does performance on ARC look like to date? Every year, a competition is held called the ARC Prize, where the developers of the algorithm that can solve all 100 problems will receive prizes from a more than $1,000,000 pool. The 2023 winners each only reached 30%, showing we still have quite a long way to AGI by this benchmark.

How far are we away from AGI?

Chollet’s work has been followed up by researchers from DeepMind. They again start from the idea that an intelligent artificial system must be able to generalise. However, they’ve simplified this dimension by breaking it into “narrow” and “general” systems.

Levels of AGI

To this they add another dimension: performance, and propose that the way to measure this in an artificial system is the percent of people it can outperform. This ranges from no humans are outperformed, to people unskilled in that task, and goes up to 50%, 90%, 99%, and finally, all people.

Let’s first have a look at systems with narrow generalisation. In terms of systems that can outperform no one, they include have calculators and similar systems that need to be totally operated by humans. GOFAI, or good-old fashion AI, is classified as emerging narrow AI - the old school type AI based on hard-coded rules. At the higher ends, we have Grammarly, which can outperform 90% of all people at grammar and spelling checking, our old friend Deep Blue, which can beat 99% of the population at chess, and AlphaFold, which can predict a protein’s 3D structure better than 100% of humans, even skilled scientists.

But these are narrow systems. As we know, the true path to AGI is in systems that can generalise. On the general side, the DeepMind team put ChatGPT as emerging AGI, stating that it can outperform unskilled people across the full range of tasks you’d expect humans to be able to do. I disagree with this, as Chollet’s definition of the scope of generality is that it encompasses a wide range of tasks that humans are able to do, and even ChatGPT4o is quite limited in what it can do right now. However, perhaps they are talking about a future, ultra augmented version of ChatGPT, like the sort of system Andrej Karparthy discusses at the end of his excellent introduction to LLMs.

One final observation is that you can see that the rest of the general side of the scale is not filled yet, showing how much work these researchers think we have to go on the path of AGI. Coupled with Chollet’s definition and work with ARC, we can see how many steps we are away from creating an artificial system with AGI.

Like Chollet, the DeepMind researchers also discuss how we might assess our progress towards AGI. We have a number of benchmarks already, the most cutting edge being the BIG-Bench, which tries to assess new capabilities that LLMs may be developing, with tasks from predicting chess moves to guessing emojis.

However, let’s look beyond LLMs, which many argue are not a path to AGI at all, and think about assessments for AI systems more broadly. The DeepMind researchers argue that in order to reflect the way that humans solve problems, more unstructured, multistep problems may be needed, like Wozniak’s coffee test that we discussed earlier.

Finally, something that needs to be thought about is how we interact with this system. Unless we have a system that is totally autonomous, which may not be particularly desirable, we will need to have some way for humans to interact with this system. This means that perhaps this system should have some traits that humans require for successful social interactions, such as empathy and theory of mind, and such “social intelligence” traits should also be part of the benchmarking for AGI. The DeepMind team spend quite a lot of time defining these levels of autonomy in their paper, and it was a section I particularly enjoyed and found very thought-provoking.

With this, we’re at the end of our series, and I hope I have been able to convince you that claims of human-like abilities and behaviours in LLMs have been wildly overstated. LLMs, while impressive, are still just machine learning models with a range of strengths and weaknesses, and their illusions of humanity are just that - illusions. If we can demystify LLMs, we can instead focus on what they can do well, while also not being tempted to extend them past their capabilities. Using LLMs as intended - as powerful, but limited, natural language models - will allow us to get the most out of them right now without exposing ourselves to unnecessary risk.

If you liked this post, and you haven’t read the previous two in this series, you might be interested in my discussion of whether LLMs have human-level language use or my exploration of whether LLMs could be sentient.