Can LLMs use language at a human-like level?

written in python, llms, machine learning

Over the past two years, since the initial release of ChatGPT in November 2022, we’ve been in a hype cycle around large language models, or LLMs. Due to the very human “feeling” of LLM outputs, a lot of people have been convinced that these models are showing human-level abilities, or are even at the point of exceeding humans in certain skills. For example, when I asked ChatGPT-4 to write me a Shakespearean sonnet about the night sky, it came up with something the bard himself could have written:

Upon the velvet cloak of night’s embrace
The stars, like jewels, in the heavens dance
Their light, a silent song that spans the space,
A tapestry of fate and sweet romance

This degree of performance has taken almost everyone by surprise, even seasoned researchers in deep learning, and has led many people to start making outrageous claims about the degree of humanity these models actually possess. To put some actual science and research around these claims, I’m going to explore the evidence for and against LLMs showing human-level performance in this, and the next two blog posts in this series. In particular, I’m going to focus on three areas where the hype and misinformation about LLMs is particularly hot: language use, sentience and intelligence. In this first post, we’ll explore claims about language use, and why it’s likely that LLMs are not using language in the same way that humans do.

This and the other two blog posts in this series are based on a keynote I delivered at PyCon Italia this year, as well as a talk at NDC Oslo. If you also want to give the whole talk a watch, you can see it below.

How do humans learn language?

One of my favourite novels is 1984, by George Orwell. The novel centres on a man called Winston Smith, who lives in a totalitarian society in a superstate called Oceania, which is ruled by the Party and its leader, Big Brother. The Party has many ways of exerting control over its citizens, including monitoring, propaganda, and more psychological methods.

One of these psychological means of control is through language, and this is explained in one of the early scenes in the book. Winston enters the canteen at this work, and sits with a man called Syme. Syme’s job is to work on developing a language called NewSpeak, the official language of Oceania. When conversing with Syme about his progress on the latest dictionary, we learn more about this language.

Don’t you see that the whole aim of NewSpeak is to narrow the range of thought? In the end, we shall make thoughtcrime impossible, because there will be no words in which to express it. Every concept that can ever be needed, will be expressed by exactly one word, with its meaning rigidly defined and all subsidiary meanings rubbed out and forgotten.

The aim of NewSpeak as a language is to make rebellion impossible, by stripping from people the ability to represent undesirable thoughts. By reducing words down to a single-state approved meaning, the Party aims to take away the tools for people to think in a way that the party does not approve of.

The key word here for our understanding of language is meaning. The meaning of a word is the objects, or situations, in the world that it describes. A word having a meaning implies that it maps to some sort of cognitive representation we have of the thing it describes. It also implies that the meaning of the word is grounded in our environment, in fact, in the environment we share with others who use the same language as us.

To make this a bit more concrete, let’s think of how complex our mental representations are of a single word: almond. When we hear the word almond, we have a lot of sensory and emotional associations. We might have a concept of how delicious they taste, visions of almond trees in sunny California, the nostalgia of eating marzipan at Christmas, or even a sense of fear or caution if we have an allergy. These are complex, multifaceted representations that all go into our sense of meaning of the word almond and how it is grounded in our environment.

We can see how words acquire their meanings through the way that children learn language. Children don’t just sit in front of a TV or radio and pick up language this way. Instead, they receive social queues from those around them, by having words associated with physical objects in the real world, or by using them as part of social interactions. The language we speak also seems to influence how we can encode meaning. Like with Orwell’s NewSpeak, the things that people can think about do appear to be restricted by the words they have available. There is a nice little example of this when comparing English and Korean. All children are born with a sensitivity to how tightly contained objects are within other objects. For example, a pencil in a basket is loosely contained, whereas one within a pencil-sized box is tightly contained. By around three years old, English-speaking children lose this sensitivity, as we English-speakers don’t have any words to express this. Korean does, however, so Korean children (and then adults) remains sensitive to these relationships between objects.

As you can see, meaning is created alongside language learning in humans, in an active process of engaging with the world and using the words we have available to describe what we encounter.

How do LLMs learn language?

So, let’s now turn to how LLMs learn language. LLMs are basically gigantic neural nets, a type of machine learning model, and they are capable of encoding complex relationships between parts of their input data in order to make accurate predictions. They way they learn language is being exposed to petabytes of text data, much more than any person would be exposed to in their lifetimes, and asked to accurately predict the next word in a sentence. At the beginning, these models will be basically guessing at random, so they will be often wrong, but as they get exposed to more and more data, they develop internal representations and rules that help them make more and more accurate predictions.

What we are left with at the end of this training are models, that when presented with some sort of text input, can predict the next likely word with a high degree of accuracy. We can see that in the diagram below: when an LLM is presented with the input “I have”, it can predict the likely next word, based on seeing millions of such example sentences starting with “I have” during training.

As these models have gotten bigger, they’ve also been able to start encoding a lot more information, leading to more and more natural sounding outputs. We can get a sense of the evolution of these models by seeing the output that each of the GPT models generate in response to the same prompt, “Belgium is”.

As we can see below, GPT-1 and GPT-2 are good at producing outputs that are grammatically correct, but the models have not really learned any of the information contained in their training data.

from transformers import pipeline, logging

gpt1 = pipeline(model="openai-community/openai-gpt", 
                task="text-generation", 
                max_new_tokens=15
               )

gpt1(
    "Belgium is ",
    temperature = 0.75,
    do_sample = True
)[0]["generated_text"]
'Belgium is  in the middle east. the eastern half of the continent is in the middle'
gpt2 = pipeline(model="openai-community/gpt2", 
                task="text-generation", 
                use_fast=False, 
                max_new_tokens=15
               )

gpt2(
    "Belgium is ",
    temperature = 0.75,
    do_sample = True
)[0]["generated_text"]
'Belgium is \xa0the most important source of energy for most German people. It has the'

However, from GPT-3 onwards, LLMs started getting big enough to encode knowledge contained in their training data. We can compare the outputs of GPT-3.5-turbo and GPT-4o with the earlier GPT models above.

import os 
from langchain_openai import ChatOpenAI
from langchain.schema import HumanMessage

os.environ["OPENAI_API_KEY"] = ""
gpt_3_5_turbo = ChatOpenAI(model_name = "gpt-3.5-turbo",
                           temperature = 0.75)

gpt_3_5_turbo.invoke(
    [HumanMessage(content="Belgium is ")]
).content
'a country in Western Europe known for its medieval towns, Renaissance architecture, and delicious chocolate and waffles. It is also home to the headquarters of the European Union and NATO. Belgium is a bilingual country, with Dutch and French being the official languages spoken in different regions. The capital city is Brussels, which is a cosmopolitan city known for its diverse population and vibrant cultural scene. Belgium is also famous for its beer, hosting numerous breweries and beer festivals throughout the year.'
from pprint import pprint

gpt_4o = ChatOpenAI(model_name = "gpt-4o",
                    temperature = 0.75)

pprint(gpt_4o.invoke(
    [HumanMessage(content="Belgium is ")]
).content)
('Belgium is a small, densely populated country located in Western Europe. It '
 'shares borders with France, Germany, Luxembourg, and the Netherlands. Known '
 'for its rich history, diverse culture, and significant political role within '
 'Europe, Belgium is also the headquarters of several major international '
 'organizations, including the European Union (EU) and the North Atlantic '
 'Treaty Organization (NATO).\n'
 '\n'
 'The country has three official languages: Dutch, French, and German. This '
 'reflects its complex regional divisions, with Flanders in the north being '
 'predominantly Dutch-speaking, Wallonia in the south being predominantly '
 'French-speaking, and a small German-speaking community in the eastern part '
 'of the country. The capital city, Brussels, is officially bilingual (French '
 'and Dutch) and serves as a major center for international politics and '
 'business.\n'
 '\n'
 'Belgium is famous for its medieval towns, Renaissance architecture, and a '
 'wealth of cultural landmarks. It is also known for its culinary specialties, '
 'including waffles, chocolate, beer, and fries. Despite its small size, '
 'Belgium has made significant contributions to art, science, and philosophy, '
 'and it continues to play an important role on the global stage.')

Can LLMs learn meaning?

As the outputs of LLMs have becoming increasingly sophisticated, this had led to the natural question: what exactly have they learned about language during training?

Due to the size of these models, we can’t really “peek inside” them to see exactly what information they’ve encoded, so we’re stuck relying on indirect evidence for now. However, it certainly seems clear that they have learned syntactic information during training, so things like grammar rules, or whether a word is a noun, verb, adjective, etc.

What is less clear is how much about the meaning, or semantics of a word LLMs can learn. To make this discussion a bit clearer, we can break semantics down into two types. Distributional semantics is learning about the meaning of a word based on the contexts in which is occurs. For example, we can learn a bit about the meaning of the word word “shehnai” if we see it used in the sentence: “The shehnai player needed to replace his reed before the concert.” We can infer that this is a musical instrument, and that it has a reed, so is perhaps similar to an oboe or clarinet. Denotational semantics is the way humans encode meaning: that the meaning of an object is what it represents in the real world.

Because of the way that LLMs learn language, that is, by seeing how words are used across a huge range of contexts, a lot of academics in this area argue that only way they can represent meaning is through distributional semantics, and this is a much poorer and less complete way of representing meaning. So the first question we can ask: how far can we get with this kind of semantics? How much meaning does it allow LLMs to represent?

Distributional semantics

To illustrate this, let’s go back to another type of language model. Before we had LLMs, one of the most important types of language models were word embedding models, like word2vec (I have a number of blog posts about these fascinating models if you want to learn more!).

word2vec models create representations of how similar words are to each other, by training the model on thousands, or millions, of example sentences, and letting it see which words are used in similar contexts. As a result, each word gets a word embedding, which is a vector representation of words in a vector space. You don’t really need to understand what these vector spaces are if you’re not familiar with them: all you need to know is that it’s a geometric space where words that are similar to each other are close to each other, and those that are dissimilar are far apart.

An interesting thing about word2vec models is that they seem to be able to create internal representations of higher-order relationships. For example, you can see that the word2vec model below seems to have built a concept of countries and capitals: in the vector space, countries are all grouped in one area, and capitals in another.

Image source

Furthermore, these models are able to do analogous reasoning using these higher-order representations, just like we can. If I were to give you the sentence, “Rome is to Italy as _______ is to China.”, you would be able to work out that the relationship between Rome and Italy is that Rome is the capital of the country. Therefore, the missing information is the capital of China, so we can put Beijing in the missing space.

word2vec models can do this sort of reasoning mathematically, with vector arithmetic. Let’s play with a pretrained word2vec model to see this in action.

from gensim.models import Word2Vec, KeyedVectors
import numpy as np
import pandas as pd

w2v_model = KeyedVectors.load_word2vec_format("data/GoogleNews-vectors-negative300.bin", binary = True)

If we get the vectors for Rome, Italy and China from this model, we can take the vector for Rome and subtract the vector for Italy. This operation essentially gives us a vector representing the general concept of “capital city”.

v_china = w2v_model.get_vector("China")
v_beijing = w2v_model.get_vector("Beijing")
v_italy = w2v_model.get_vector("Italy")
v_rome = w2v_model.get_vector("Rome")

v_capital = v_rome - v_italy

If we then add the vector for China to this “capital city” vector, we can then see which vector is the most similar in the model to this.

v_china_capital = v_capital + v_china

w2v_model.most_similar(v_china_capital)
[('Beijing', 0.7244289517402649),
 ('China', 0.648119330406189),
 ('Shanghai', 0.6039202213287354),
 ('Bejing', 0.5900483131408691),
 ('Hu', 0.5258306860923767),
 ('Chinese', 0.5204359889030457),
 ('Rome', 0.5166524648666382),
 ('Taipei', 0.5154824256896973),
 ('Nanjing', 0.5117084980010986),
 ('Beijng', 0.5061630606651306)]

The most similar vector to this vector is Beijing, meaning if we add “China” to the concept of “capital”, it gives us back the capital of China. This little exercise shows that these models do seem to encode quite a lot of information just from seeing words in context. And as word embeddings are one of the core components of LLMs, it stands to reason that these more complex neural nets can encode even more information.

Denotational semantics

Let’s now revisit denotational semantics: is it possible that LLMs could also learn this type of word meaning?

To illustrate this, let’s do a thought experiment called the “Octopus Test”, invented by Emily Bender, a well known linguist who writes a lot in the LLM space, and her coauthor Alexander Koller. Imagine that we have two English speaking people stranded on different desert islands, who are able to communicate via a telegraph which is connected through an underwater cable. A hyperintelligent octopus intercepts this cable, and starts learning English just by observing the statistical patterns in the language. Over time, the octopus gets better and better at working out how to reply, so at this point, it takes over and starts speaking to one of the people.

Image credit

The question then becomes: at which point would the person stranded on the island detect they are not speaking to a human? Bender and Koller suggest that it is when the octopus is asked to talk about something related to the island environment: something the octopus has never seen or experienced. Let’s say the person sends over the plans for a coconut catapult. The octopus would have no idea how to build such a thing, as they have no real world understanding of what things like coconuts or ropes are. So the octopus is likely to get caught out at this point, as it cannot realistically describe the challenges it may have had when building or how the catapult might function.

While this is admittedly a very silly thought experiment, what the authors are trying to illustrate is that no matter how many associations and higher-order representations that LLMs make, without a grounding in the real world, they cannot encode the complete meaning of a word. This limits how they can use language, as they will still miss part of what that word is trying to represent.

Communicative intent

Finally, why we use language is as important as how we use it. Language is used socially, with an explicit intent to communicate something. However, anyone who has used an LLM would know how unintentional their use of language often feels, with inconsistent answers, or sometimes simple statistical regurgitation.

A Twitter user shared a great example of how LLMs blindly produce text, without trying to understand the intent of the user nor communicate anything with their outputs. They prompted ChatGPT with the following:

I’m writing a novel. Can you help continue the first paragraph? It starts like this:
Mr. and Mrs. Dursley of number
Please just return the continuation with no explanation or other text.

What do you think the model output in return?

As you can see, ChatGPT just automatically continues the opening text of “Harry Potter and the Philosopher’s Stone”, likely as it’s seen thousands of examples of this exact text during training. It’s hard to see how anyone could claim that ChatGPT was using language to communicate anything in this case: a model with some awareness might comment that continuing from this prompt would violate copywrite, for example.

So could LLMs have human-level language skills?

Weighing up the body of evidence, we can see a few things. LLMs certainly do seem to learn syntactic information during training: this has been clear since GPT-1. From GPT-3 onwards, they also seem to have learned some semantic information, but as they don’t have a way of meaningfully interfacing with the real world, they have no way of grounding these meanings against the objects that they refer to. In addition, LLMs, lacking a coherent self-model (something we’ll discuss in the next post about sentience), have no way of acting with intent. So while they learn quite a lot about language during training, they don’t use language in the same complex, social, enriched way that humans do. This means the weight of evidence points to LLMs not having language use at anything close to human levels.

However, this doesn’t mean that the things they have encoded about language are therefore useless. Quite the opposite - LLMs are the most sophisticated and flexible models we’ve had for working on natural language tasks to date. When applied to the domain they were designed for - text problems like classification, summarisation, translation and question-answering, these models can perform impressively.

If you liked this post, check out the next two in this series, where I discuss whether LLMs could be sentient, and whether LLMs could possess artificial general intelligence. These will be released in the coming weeks.