A Deep Dive on Large Language Models—And What They Mean For You
The release of OpenAI’s ChatGPT in late 2022 has utterly transformed the conversation around artificial intelligence. Whether it’s generating functioning web apps with just a few prompts, writing Spanish-language children’s stories about the blockchain in the style of Dr. Suess, or opining on the virtues and vices of major political figures, its ability to generate long strings of coherent, grammatically-correct text is shocking.
Seen in this light, it’s perhaps no surprise that ChatGPT has achieved such a staggering rate of growth. The application garnered a million users less than a week after its launch.
It’s believed that by January of 2023, this figure had climbed to 100 million monthly users, blowing past the adoption rates of TikTok (which needed nine months to get to this many monthly users) and Instagram (which took over two years.)
Naturally, many have become curious about the “large language model” (LLM) technology that makes ChatGPT and similar kinds of disruptive generative AI possible.
In this piece, we’re going to do a deep dive on LLMs, exploring how they’re trained, how they work internally, and how they might be deployed in your business. Our hope is that this will arm Quiq’s customers with the context they need to keep up with the ongoing AI revolution.
What Are Large Language Models?
LLMs are pieces of software with the ability to interact with and generate a wide variety of text. In this discussion, “text” is used very broadly to include not just existing natural language but also computer code.
A good way to begin exploring this subject is to analyze each of the terms in “large language model”, so let’s do that now.
LLMs Are Models.
In machine learning (ML), you can think of a model as being a function that maps inputs to outputs. Early in their education, for example, machine learning engineers usually figure out how to fit a linear regression model that does something like predict the final price of a house based on its square footage.
They’ll feed their model a bunch of data points that look like this:
House 1: 800 square feet, $120,000
House 2: 1000 square feet, $175,000
House 3: 1500 square feet, $225,000
And the model learns the relationship between square footage and price well enough to roughly predict the price of homes that weren’t in its training data.
We’ll have a lot more to say about how LLMs are trained in the next section. For now, just be aware that when you get down to it, LLMs are inconceivably vast functions that take the input you feed them and generate a corresponding output.
LLMs Are Large.
Speaking of vastness, LLMs are truly gigantic. As with terms like “big data”, there isn’t an exact, agreed-upon point at which a basic language model becomes a large language model. Still, they’re plenty big enough to deserve the extra “L” at the beginning of their name.
There are a few ways to measure the size of machine learning models, but one of the most common is by looking at their parameters.
In the linear regression model just discussed, there would be only one parameter, for square footage. We could make our model better by also showing it the home’s zip code and the number of bathrooms it has, and then it would have three parameters.
It’s hard to say how big most real systems are because that information isn’t usually made public, but a linear regression model might have dozens of parameters, and a basic neural network could range from a few hundred thousand to a few tens of millions of parameters.
GPT-3 has 175 billion parameters, and Google’s Minerva model has 540 billion parameters. It isn’t known how many parameters GPT-4 has, but it’s almost certainly more.
(Note: I say “almost” certainly because better models don’t always have more parameters. They usually do, but it’s not an ironclad rule.)
LLMs Focus On Language.
ChatGPT and its cousins take text as input and produce text as output. This makes them distinct from some of the image-generation tools that are on the market today, such as DALL-E and Midjourney.
It’s worth noting, however, that this might be changing in the future. Though most of what people are using GPT-4 to do revolves around text, technically, the underlying model is multimodal. This means it can theoretically interact with image inputs as well. According to OpenAI’s documentation, support for this feature should arrive in the coming months.
How Are Large Language Models Trained?
Like all machine learning models, LLMs must be trained. We don’t actually know exactly how OpenAI trained the latest GPT models, as they’ve kept those details secret, but we can make some broad comments about how systems like these are generally trained.
Before we get into technical details, let’s frame the overall task that LLMs are trying to perform as a guessing game. Imagine that I start a sentence and leave out the last word, asking you to provide a guess as to how it ends.
Some of these would be fairly trivial; everyone knows that “[i]t was the best of times, it was the worst of _____,” ends with the word “times.” Others would be more ambiguous; “I stopped to pick a flower, and then continued walking down the ____,” could plausibly end with words like “road”, “street”, or “trail.”
For still others, there’d be an almost infinite number of possibilities; “He turned to face the ___,” could end with anything from “firehose” to “firing squad.”
But how is it that you’re able to generate these guesses? How do you know what a good ending to a natural-language sentence sounds like?
The answer is that you’ve been “training” for this task your entire life. You’ve been listening to sentences, reading and writing sentences, or thinking in sentences for most of your waking hours, and have therefore developed a sense of how they work.
The process of training an LLM differs in many specifics, but at a high level, it’s learning to do the same thing. A model like GPT-4 is fed gargantuan amounts of textual data from the internet or other sources, and it learns a statistical distribution that allows it to predict which words come next.
At first, it’ll have no idea how to end the sentence “[i]t was the best of times, it was the worst of ____.” But as it sees more and more examples of human-generated textual content, it improves. It discovers that when someone writes “red, orange, yellow, green, blue, indigo, ______”, the next sequence of letters is probably “violet”. It begins to be more sensitive to context, discovering that the words “bat”, “diamond”, and “plate” are probably occurring in a discussion about baseball and not the weirdest Costco you’ve ever been to.
It’s precisely this nuance that makes advanced LLMs suitable for applications such as customer service.
They’re not simply looking up pre-digested answers to questions, they’re learning a function big enough to account for the subtleties of a specific customer’s specific problem. They still don’t do this job perfectly, but they’ve made remarkable progress, which is why so many companies are looking at integrating them.
Getting into the GPT-weeds
The discussion so far is great for building a basic intuition for how LLMs are trained, but this is a deep dive, so let’s talk technical specifics.
Though we don’t know much about GPT-4, earlier models like GPT and GPT-2 have been studied in great detail. By understanding how they work, we can cultivate a better grasp of cutting-edge models.
When an LLM is trained, it’s fed a great deal of text data. It will grab samples from this data, and try to predict the next token in its sample. To make our earlier explanation easier to understand we implied that a token is a word, but that’s not quite right. A token can be a word, an individual letter, or “sub words”, i.e. small chunks of letters and spaces.
This process is known as “self-supervised learning” because the model can assess its own accuracy by checking its predicted next token against the actual next token in the dataset it’s training on.
At first, its accuracy is likely to be very bad. But as it trains its internal parameters (remember those?) are tuned with an optimizer such as stochastic gradient descent, and it gets better.
One of the crucial architectural building blocks of LLMs is the transformer.
A full discussion of transformers is well beyond the scope of this piece, but the most important thing to know is that transformers can use “attention” to model more complex relationships in language data.
For example: in a sentence like “the dog didn’t chase the cat because it was too tired”, every human knows that “it” refers to the dog and not the cat. Earlier approaches to building language models struggled with such connections in sentences that were longer than a few words, but using attention, transformers can handle them with ease.
In addition to this obvious advantage, transformers have found widespread use in deep learning applications such as language models because they’re easy to parallelize, meaning that training times can be reduced.
Building On Top Of Large Language Models
Out-of-the-box LLMs are pretty powerful, but it’s often necessary to tweak them for specific applications such as enterprise bots. There are a few ways of doing this, and we’re going to confine ourselves to two major approaches: fine-tuning and prompt engineering.
First up, it’s possible to fine-tune some of these models. Fine-tuning an LLM involves providing a training set and letting the model update its internal weights to perform better on a specific task.
Next, the emerging discipline of prompt engineering refers to the practice of systematically crafting the text fed to the model to get it to better approximate the behavior you want.
LLMs can be surprisingly sensitive to small changes in words, phrases, and context; the job of a prompt engineer, therefore, is to develop a feel for these sensitivities and construct prompts in a way that maximizes the performance of the LLM.
How Can Large Language Models Be Used In Business?
There is a new gold rush in applying AI to business use cases.
For starters, given how good they are at generating text, they’re being deployed to write email copy, blog posts, and social media content, to text or survey customers, and to summarize text.
LLMs are also being used in software development. Tools like Replit’s Ghostwriter are already dramatically improving developer productivity in a variety of domains, from web development to machine learning.
What Are The “LLiMitations” Of LLMs?
For all their power, LLMs have turned out to have certain well-known limitations. To begin with, LLMs are capable of being toxic, harmful, aggressive, and biased.
Though heroic efforts have been made to train this behavior out with techniques such as reinforcement learning from human feedback, it’s possible that it can reemerge under the right conditions.
This is something you should take into account before giving customers access to generative AI offerings.
Another oft-discussed limitation is the tendency of LLMs to “invent” facts. Remember, an LLM is just trying to predict sequences of tokens, and there’s no reason it couldn’t output a sequence of text like “Dr. Micha Sartorius, professor of applied computronics at Santa Grega University”, even though this person, field, and university are fictitious.
This, too, is something you should be cognizant of before letting customers interact with generative AI.
At Quiq, we harness the power of LLMs’ language-generating capabilities, while putting strict guardrails in place to prevent these risks that are inherent to public-facing generative AI.
Should You Be Using Large Language Models?
LLMs are a remarkable engineering achievement, having been trained on vast amounts of human text and able to generate whole conversations, working code, and more.
No doubt, some of the fervor around LLMs will end up being hype. Nevertheless, the technology has been shown to be incredibly powerful, and it is unlikely to go anywhere. If you’re interested in learning about how to integrate generative AI applications like Quiq’s into your business, schedule a demo with us today!