How do LLMs actually work?

Conversational AI models have captured our imagination not only by being able to resemble a real conversation but also by actually being useful. And there is more than one model too. It seems like if you throw a hypothetical rock into the internet, you will very likely hit a conversation AI model. Be it ChatGPT, Claude, DeepSeek or Llama. Just to name a few.

But what are they actually? and how do they work?

Well, ChatGPT is powered by an underlying Large Language Model (LLM), which is a type of machine learning model based on what’s known as a deep learning neural network, designed for natural language processing.

But really, an LLM is nothing more than a complicated mathematical function that predicts what word comes next for a particular piece of text. This word is not predicted with certainty though, the model instead assigns a probability to all possible next words, and then chooses one.

So, if you want to build a chatbot like ChatGPT you essentially need to present a text that describes some interaction between a User and a “hypothetical” AI assistant. Like a script in a movie where two characters are talking.

You add what the user types as the first part of the interaction, and then you have the model predict the next word that this “hypothetical” AI assistant would say. And this response is what is then shown to the user.

The results normally sound more natural if you allow the model to present less likely words from time to time, at random. This is why even though the model itself is deterministic, a given prompt can result in different answers each time you run it.

These predictions are learned by processing huge amounts of text. From the open internet, and many other sources. For the model to learn, you need to train it of course.

We can think of training an LLM being like turning the dials on a big machine until we get just the right settings. The dials in the models are called parameters or weights. Changing these values changes the probabilities that the model gives on what words come next on a given input.

LLMs can have 100s of billions of these parameters. When the training starts, these parameters are initialized at random and then refined by looking at many examples of text during training.

Training works essentially by showing a piece of text with the last word missing to the model, for it to predict what comes next. We then compare the model’s prediction with the word we expected. Once we have this, we can use the “backpropagation algorithm” to tweak the parameters to make the model more likely to choose the correct last word, and less likely to choose the other words. After a lot of training, the model becomes better at predicting the text it was trained on, and also at predicting text it has never seen before.

Now, given the large amount of parameters and the huge amount of training data, this requires an insane amount of computation. And this is only the first part of the training, also known as the “pre-training”.

In order for a model to be a good “AI assistant”, LLMs also need another form of training called “reinforcement learning with human feedback” where humans identify unhelpful predictions. This feedback continues to change the parameters, making the model more likely to give predictions the user prefers.

The amount of computation needed to do this is possible with special computer chips, GPUs, that can do these computations in parallel. However, not every LLM can be easily parallelized.

In 2017 a team of Google researchers published a paper called “Attention is all you need” where they introduce a new type of model optimized for natural language. The Transformer.

This type of model does not read all of the text from start to finish. Instead, it takes all of the text at once, in parallel. To do this, they associate each word with a long list of numbers. This is because the training process only works with continuous values. And this number should encode the meaning of the word.

Transformers are special because they use a special operation called “attention” which enables the list of numbers to talk to one another and refine the meanings of the words based on the context around them. All in parallel.

Transformers also include another special operation. A feed-forward neural network that gives the model extra capacity to store more patterns about the language learned during the training.

All of this data goes through many iterations of these two operations, attention, and a feed-forward neural network, and the idea is that the list of numbers is enriched to encode all of the information needed to make an accurate prediction of what word will follow in the text.

Source

3Blue1Brown. (2017, July 19). But what is a Fourier series? [Video]. YouTube. https://www.youtube.com/watch?v=LPZh9BOjkQs

PD: if you want to see a better explanation of this article I would really recommend the video above and the channel as a whole!