Exploring Cutting-Edge Research in Large Language Models and Generative AI

By the calendar, ChatGPT was released just a few months ago. But subjectively, it feels as though 600 years have passed since we all read “as a large language model…” for the first time.

The pace of new innovations is staggering, but we at Quiq like to help our audience in the customer experience and contact center industries stay ahead of the curve (even when that requires faster-than-light travel).

Today, we will look at what’s new in generative AI, and what will be coming down the line in the months ahead.

Where will Generative AI be applied?

First, let’s start with industries that will be strongly impacted by generative AI. As we noted in an earlier article, training a large language model (LLM) like ChatGPT mostly boils down to showing it tons of examples of text until it learns a statistical representation of human language well enough to generate sonnets, email copy, and many other linguistic artifacts.

There’s no reason the same basic process (have it learn it from many examples and then create its own) couldn’t be used elsewhere, and in the next few sections, we’re going to look at how generative AI is being used in a variety of different industries to brainstorm structures, new materials, and a billion other things.

Generative AI in Building and Product Design

If you’ve had a chance to play around with DALL-E, Midjourney, or Stable Diffusion, you know that the results can be simply remarkable.

It’s not a far leap to imagine that it might be useful for quickly generating ideas for buildings and products.

The emerging field of AI-generated product design is doing exactly this. With generative image models, designers can use text prompts to rough out ideas and see them brought to life. This allows for faster iteration and quicker turnaround, especially given that creating a proof of concept is one of the slower, more tedious parts of product design.

Image source: Board of Innovation

 

For the same reason, these tools are finding use among architects who are able to quickly transpose between different periods and styles, see how better lighting impacts a room’s aesthetic, and plan around themes like building with eco-friendly materials.

There are two things worth pointing out about this process. First, there’s often a learning curve because it can take a while to figure out prompt engineering well enough to get a compelling image. Second, there’s a hearty dose of serendipity. Often the resulting image will not be quite what the designer had in mind, but it’ll be different in new and productive ways, pushing the artist along fresh trajectories that might never have occurred to them otherwise.

Generative AI in Discovering New Materials

To quote one of America’s most renowned philosophers (Madonna), we’re living in a material world. Humans have been augmenting their surroundings since we first started chipping flint axes back in the Stone Age; today, the field of materials science continues the long tradition of finding new stuff that expands our capabilities and makes our lives better.

This can take the form of something (relatively) simple like researching a better steel alloy, or something incredibly novel like designing a programmable nanomaterial.

There’s just one issue: it’s really, really difficult to do this. It takes a great deal of time, energy, and effort to even identify plausible new materials, to say nothing of the extensive testing and experimenting that must then follow.

Materials scientists have been using machine learning (ML) in their process for some time, but the recent boom in generative AI is driving renewed interest. There are now a number of projects aimed at e.g. using variational autoencoders, recurrent neural networks, and generative adversarial networks to learn a mapping between information about a material’s underlying structure and its final properties, then using this information to create plausible new materials.

It would be hard to overstate how important the use of generative AI in materials science could be. If you imagine the space of possible molecules as being like its own universe, we’ve explored basically none of it. What new fabrics, medicines, fuels, fertilizers, conductors, insulators, and chemicals are waiting out there? With generative AI, we’ve got a better chance than ever of finding out.

Generative AI in Gaming

Gaming is often an obvious place to use new technology, and that’s true for generative AI as well. The principles of generative design we discussed two sections ago could be used in this context to flesh out worlds, costumes, weapons, and more, but it can also be used to make character interactions more dynamic.

From Navi trying to get our attention in Ocarina of Time to GlaDOS’s continual reminders that “the cake is a lie” in Portal, non-playable characters (NPCs) have always added texture and context to our favorite games.

Powered by LLMs, these characters may soon be able to have open-ended conversations with players, adding more immersive realism to the gameplay. Rather than pulling from a limited set of responses, they’d be able to query LLMs to provide advice, answer questions, and shoot the breeze.

What’s Next in Generative AI?

As impressive as technologies like ChatGPT are, people are already looking for ways to extend their capabilities. Now that we’ve covered some of the major applications of generative AI, let’s look at some of the exciting applications people are building on top of it.

What is AutoGPT and how Does it Work?

ChatGPT can already do things like generate API calls and build simple apps, but as long as a human has to actually copy and paste the code somewhere useful, its capacities are limited.

But what if that weren’t an issue? What if it were possible to spin ChatGPT up into something more like an agent, capable of semi-autonomously interacting with software or online services to complete strings of tasks?

This is exactly what Auto-GPT is intended to accomplish. Auto-GPT is an application built by developer Toran Bruce Richards, and it is comprised of two parts: an LLM (either GPT-3.5 or GPT-4), and a separate “bot” that works with the LLM.

By repeatedly querying the LLM, the bot is able to take a relatively high-level task like “help me set up an online business with a blog and a website” or “find me all the latest research on quantum computing”, decompose it into discrete, achievable steps, then iteratively execute them until the overall objective is achieved.

At present, Auto-GPT remains fairly primitive. Just as ChatGPT can get stuck in repetitive and unhelpful loops, so too can Auto-GPT. Still, it’s a remarkable advance, and it’s spawned a series of other projects attempting to do the same thing in a more consistent way.

The creators of AssistGPT bill it as a “General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn”. It handles multi-modal tasks (i.e. tasks that rely on vision or sound and not just text) better than Auto-GPT, and by integrating with a suite of tools it is able to achieve objectives that involve many intermediate steps and sub-tasks.

SuperAGI, in turn, is just as ambitious. It’s a platform that offers a way to quickly create, deploy, manage, and update autonomous agents. You can integrate them into applications like Slack or vector databases, and it’ll even ping you if an agent gets stuck somewhere and starts looping unproductively.

Finally, there’s LangChain, which is a similar idea. LangChain is a framework that is geared towards making it easier to build on top of LLMs. It features a set of primitives that can be stitched into more robust functionality (not unlike “for” and “while” loops in programming languages), and it’s even possible to build your own version of AutoGPT using LangChain.

What is Chain-of-Thought Prompting and How Does it Work?

In the misty, forgotten past (i.e. 5 months ago), LLMs were famously bad at simple arithmetic. They might be able to construct elegant mathematical proofs, but if you asked them what 7 + 4 is, there was a decent chance they’d get it wrong.

Chain-of-thought (COT) prompting refers to a few-shot learning method of eliciting output from an LLM that compels it to reason in a step-by-step way, and it was developed in part to help with this issue. This image from the original Wei et al. (2022) paper illustrates how:

Input and output examples for Standard and Chain-of-thought Prompting.
Source: ARXIV.org

As you can see, the model’s performance is improved because it’s being shown a chain of different thoughts, hence chain-of-thought.

This technique isn’t just useful for arithmetic, it can be utilized to get better output from a model in a variety of different tasks, including commonsense and symbolic reasoning.

In a way, humans can be prompt engineered in the same fashion. You can often get better answers out of yourself or others through a deliberate attempt to reason slowly, step-by-step, so it’s not a terrible shock that a large model trained on human text would benefit from the same procedure.

The Ecosystem Around Generative AI

Though cutting-edge models are usually the stars of the show, the truth is advanced technologies aren’t worth much if you have to be deeply into the weeds to use them. Machine learning, for example, would surely be much less prevalent if tools like sklearn, Tensorflow, and Keras didn’t exist.

Though we’re still in the early days of LLMs, AutoGPT, and everything else we’ve discussed, we suspect the same basic dynamic will play out. Since it’s now clear that these models aren’t toys, people will begin building infrastructure around them that streamlines the process of training them for specific use cases, integrating them into existing applications, etc.

Let’s discuss a few efforts in this direction that are already underway.

Training and Education

Among the simplest parts of the emerging generative AI value chain is exactly what we’re doing now: talking about it in an informed way. Non-specialists will often lack the time, context, and patience required to sort the real breakthroughs from the hype, so putting together blog posts, tutorials, and reports that make this easier is a real service.

Making Foundation Models Available

“Foundation models” is a new term that refers to the actual algorithms that underlie LLMs. ChatGPT, for example, is not a foundation model. GPT-4 is the foundation model, and ChatGPT is a specialized application of it (more on this shortly).

Companies like Anthropic, Google, and OpenAI can train these gargantuan models and then make them available through an API. From there, developers are able to access their preferred foundation model over an API.

This means that we can move quickly to utilize their remarkable functionality, which wouldn’t be the case if every company had to train their own from scratch.

Building Applications Around Specific Use Cases

One of the most striking properties of ChatGPT is how amazingly general they are. They are capable of “…generating functioning web apps with just a few prompts, writing Spanish-language children’s stories about the blockchain in the style of Dr. Suess, [and] opining on the virtues and vices of major political figures”, to name but a few examples.

General-purpose models often have to be fine-tuned to perform better on a specific task, especially if they’re doing something tricky like summarizing medical documents with lots of obscure vocabulary. Alas, there is a tradeoff here, because in most cases these fine-tuned models will afterward not be as useful for generic tasks.

The issue, however, is that you need a fair bit of technical skill to set up a fine-tuning pipeline, and you need a fair bit of elbow grease to assemble the few hundred examples a model needs in order to be fine-tuned. Though this is much simpler than training a model in the first place it is still far from trivial, and we expect that there will soon be services aimed at making it much more straightforward.

LLMOps and Model Hubs

We’d venture to guess you’ve heard of machine learning, but you might not be familiar with the term “MLOps”. “Ops” means “operations”, and it refers to all the things you have to do to use a machine learning model besides just training it. Once a model has been trained it has to be monitored, for example, because sometimes its performance will begin to inexplicably degrade.

The same will be true of LLMs. You’ll need to make sure that the chatbot you’ve deployed hasn’t begun abusing customers and damaging your brand, or that the deep learning tool you’re using to explore new materials hasn’t begun to spit out gibberish.

Another phenomenon from machine learning we think will be echoed in LLMs is the existence of “model hubs”, which are places where you can find pre-trained or fine-tuned models to use. There certainly are carefully guarded secrets among technologists, but on the whole, we’re a community that believes in sharing. The same ethos that powers the open-source movement will be found among the teams building LLMs, and indeed there are already open-sourced alternatives to ChatGPT that are highly performant.

Looking Ahead

As they’re so fond of saying on Twitter, “ChatGPT is just the tip of the iceberg.” It’s already begun transforming contact centers, boosting productivity among lower-skilled workers while reducing employee turnover, but research into even better tools is screaming ahead.

Frankly, it can be enough to make your head spin. If LLMs and generative AI are things you want to incorporate into your own product offering, you can skip the heady technical stuff and skip straight to letting Quiq do it for you. The Quiq conversational AI platform is a best-in-class product suite that makes it much easier to utilize these technologies. Schedule a demo to see how we can help you get in on the AI revolution.

How to Evaluate Generated Text and Model Performance

Machine learning is an incredibly powerful technology. That’s why it’s being used in everything from autonomous vehicles to medical diagnoses to the sophisticated, dynamic AI Assistants that are handling customer interactions in modern contact centers.

But for all this, it isn’t magic. The engineers who build these systems must know a great deal about how to evaluate them. How do you know when a model is performing as expected, or when it has begun to overfit the data? How can you tell when one model is better than another?

This subject will be our focus today. We’ll cover the basics of evaluating a machine learning model with metrics like mean squared error and accuracy, then turn our attention to the more specialized task of evaluating the generated text of a large language model like ChatGPT.

How to Measure the Performance of a Machine Learning Model?

A machine learning model is always aimed at some task. It might be trying to fit a regression line that helps predict the future price of Bitcoin, it might be clustering documents according to their topics, or it might be trying to generate text so good it rivals that produced by humans.

How does the model know when it’s gotten the optimal line or discovered the best way to cluster documents? (And more importantly, how do you know?)

In the next few sections, we’ll talk about a few common ways of evaluating the performance of a machine-learning model. If you’re an engineer this will help you create better models yourself, and if you’re a layperson, it’ll help you better understand how the machine-learning pipeline works.

Evaluation Metrics for Regression Models

Regression is one of the two big types of basic machine learning, with the other being classification.

In tech-speak, we say that the purpose of a regression model is to learn a function that maps a set of input features to a real value (where “real” just means “real numbers”). This is not as scary as it sounds; you might try to create a regression model that predicts the number of sales you can expect given that you’ve spent a certain amount on advertising, or you might try to predict how long a person will live on the basis of their daily exercise, water intake, and diet.

In each case, you’ve got a set of input features (advertising spend or daily habits), and you’re trying to predict a target variable (sales, life expectancy).

The relationship between the two is captured by a model, and a model’s quality is evaluated with a metric. Popular metrics for regression models include the mean squared error, the root mean squared error, and the mean absolute error (though there are plenty of others if you feel like going down a nerdy rabbit hole).

The mean squared error (MSE) quantifies how good a regression model is by calculating the difference between the line and each real data point, squaring them (so that positive and negative differences don’t cancel out), and then averaging them. This gives a single number that the training algorithm can use to adjust its model – if the MSE is going down, the model is getting better, if it’s going up, it’s getting worse.

The root mean squared error (RMSE) does the exact same thing, but the final step is that you take the square root of the MSE. The big advantage here is that it converts the units of your metric back into the units you’re using in your problem (i.e. the “squared dollars” of MSE become “dollars” again, which makes it easier to think about what’s going on).

The mean absolute error (MAE) is the same basic idea, but it uses absolute values instead of squares. This also has the advantage of not penalizing outliers as much as the RMSE does. If you’ve got some outlier data point that’s far away from your model, squaring the difference will result in a bigger error than simply taking the absolute value of that difference. For this reason, it’s less sensitive to outliers in the dataset.

Evaluation Metrics for Classification Models

People tend to struggle less with understanding classification models because it’s more intuitive: you’re building something that can take a data point (the price of an item) and sort it into one of a number of different categories (i.e. “cheap”, “somewhat expensive”, “expensive”, “very expensive”).

Of course, the categories you choose will depend on the problem you’re trying to solve and the domain you’re operating in – a $100 apple is certainly “very expensive”, but a $100 dollar wedding ring…will probably get you left at the altar.

Regardless, it’s just as essential to evaluate the performance of a classification model as it is to evaluate the performance of a regression model. Some common evaluation metrics for classification models are accuracy, precision, and recall.

Accuracy is simple, and it’s exactly what it sounds like. You find the accuracy of a classification model by dividing the number of correct predictions it made by the total number of predictions it made altogether. If your classification model made 1,000 predictions and got 941 of them right, that’s an accuracy rate of 94.1% (not bad!)

Both precision and recall are subtler variants of this same idea. The precision is the number of true positives (correct classifications) divided by the sum of true positives and false positives (incorrect positive classifications). It says, in effect, “When your model thought it had identified a needle in a haystack, this is how often it was correct.”

The recall is the number of true positives divided by the sum of true positives and false negatives (incorrect negative classifications). It says, in effect “There were 200 needles in this haystack, and your model found 72% of them.”

Accuracy tells you how well your model performed overall, precision tells you how confident you can be in its positive classifications, and recall tells you how often it found the positive classifications.

(You may be wondering if this isn’t overkill. Do we really need all these different ratios? Answering that question fully would take us too far from our purpose of measuring the quality of text from generative AI models, but suffice it to say that there are trade-offs involved. Sometimes it makes more sense to focus on boosting the precision, other times getting a higher recall is more important. These are all just different tools for figuring out how to spend your limited time and energy to get a model that best solves your problem.)

Contact Us

How Can I Assess the Performance of a Generative AI Model?

Now, we arrive at the center of this article. Everything up to now has been background context that hopefully has given you a feel for how models are evaluated, because from here on out it’s a bit more abstract.

Using Reference Text for Evaluating Generative Models

When we wanted to evaluate a regression model, we started by looking at how far its predictions were from actual data points.

Well, we do essentially the same thing with generative language models. To assess the quality of text generated by a model, we’ll compare it against high-quality text that’s been selected by domain experts.

The Bilingual Evaluation Understudy (BLEU) Score

The BLEU score can be used to actually quantify the distance between the generated and reference text. It does this by comparing the amount of overlap in the n-grams [1] between the two using a series of weighted precision scores.

The BLEU score varies from 0 to 1. A score of “0” indicates that there is no n-gram overlap between the generated and reference text, and the model’s output is considered to be of low quality. A score of “1”, conversely, indicates that there is total overlap between the generated and reference text, and the model’s output is considered to be of high quality.

Comparing BLEU scores across different sets of reference texts or different natural languages is so tricky that it’s considered best to avoid it altogether.

Also, be aware that the BLEU score contains a “brevity penalty” which discourages the model from being too concise. If the model’s output is too much shorter than the reference text, this counts as a strike against it.

The Recall-Oriented Understudy for Gisting Evaluation (ROGUE) Score

Like the BLEU score, the ROGUE score is examining the n-gram overlap between an output text and a reference text. Unlike the BLEU score, however, it uses recall instead of precision.

There are three types of ROGUE scores:

  1. rogue-n: Rogue-n is the most common type of ROGUE score, and it simply looks at n-gram overlap, as described above.
  2. rogue-l: Rogue-l looks at the “Longest Common Subsequence” (LCS), or the longest chain of tokens that the reference and output text share. The longer the LCS, of course, the more the two have in common.
  3. rogue-s: This is the least commonly-used variant of the ROGUE score, but it’s worth hearing about. Rogue-s concentrates on the “skip-grams” [2] that the two texts have in common. Rogue-s would count “He bought the house” and “He bought the blue house” as overlapping because they have the same words in the same order, despite the fact that the second sentence does have an additional adjective.

The Metric for Evaluation of Translation with Explicit Ordering (METEOR) Score

The METEOR Score takes the harmonic mean of the precision and recall scores for 1-gram overlap between the output and reference text. It puts more weight on recall than on precision, and it’s intended to address some of the deficiencies of the BLEU and ROGUE scores while maintaining a pretty close match to how expert humans assess the quality of model-generated output.

BERT Score

At this point, it may have occurred to you to wonder whether the BLEU and ROGUE scores are actually doing a good job of evaluating the performance of a generative language model. They look at exact n-gram overlaps, and most of the time, we don’t really care that the model’s output is exactly the same as the reference text – it needs to be at least as good, without having to be the same.

The BERT score is meant to address this concern through contextual embeddings. By looking at the embeddings behind the sentences and comparing those, the BERT score is able to see that “He quickly ate the treats” and “He rapidly consumed the goodies” are expressing basically the same idea, while both the BLEU and ROGUE scores would completely miss this.

Final thoughts.

We’ve all seen what generative AI can do, and it’s fair at this point to assume this technology is going to become more prevalent in fields like software engineering, customer service, customer experience, and marketing.

But, as magical as generative AI might seem to be, they’re just models. They have to be evaluated and monitored just like any other, or you risk having a bad one negatively impact your brand.

If you’re enchanted by the potential of using generative algorithms in your contact center but are daunted by the challenge of putting together an engineering team, reach out to us for a demo of the Quiq conversational CX platform. We can help you put this cutting-edge technology to work without having to worry about all the finer details and resourcing issues.

***

Footnotes

[1] An n-gram is just a sequence of characters, words, or entire sentences. A 1-gram is usually single words, a 2-gram is usually two words, etc.
[2] Skip-grams are a rather involved subdomain of natural language processing. You can read more about them in this article, but frankly, most of it is irrelevant to this article. All you need to know is that the rogue-s score is set up to be less concerned with exact n-gram overlaps than the alternatives.

How to Get the Most out of Your NLP Models with Preprocessing

Along with computer vision, natural language processing (NLP) is one of the great triumphs of modern machine learning. While ChatGPT is all the rage and large language models (LLMs) are drawing everyone’s attention, that doesn’t mean that the rest of the NLP field just goes away.

NLP endeavors to apply computation to human-generated language, whether that be the spoken word or text existing in places like Wikipedia. There are any number of ways in which this would be relevant to customer experience and service leaders, including:

Today, we’re going to briefly touch on what NLP is, but we’ll spend the bulk of our time discussing how textual training data can be preprocessed to get the most out of an NLP system. There are a few branches of NLP, like speech synthesis and text-to-speech, which we’ll be omitting.

Armed with this context, you’ll be better prepared to evaluate using NLP in your business (though if you’re building customer-facing chatbots, you can also let the Quiq platform do the heavy lifting for you).

What is Natural Language Processing?

In the past, we’ve jokingly referred to NLP as “doing computer stuff with words after you’ve tricked them into being math.” This is meant to be humorous, but it does capture the basic essence.

Remember, your computer doesn’t know what words are, all it does is move 1’s and 0’s around. A crucial step in most NLP applications, therefore, is creating a numerical representation out of the words in your training corpus.

There are many ways of doing this, but today a popular method is using word vector embeddings. Also known simply as “embeddings”, these are vectors of real numbers. They come from a neural network or a statistical algorithm like word2vec and stand in for particular words.

The technical details of this process don’t concern us in this post, what’s important is that you end up with vectors that capture a remarkable amount of semantic information. Words with similar meanings also have similar vectors, for example, so you can do things like find synonyms for a word by finding vectors that are mathematically close to it.

These embeddings are the basic data structures used across most of NLP. They power sentiment analysis, topic modeling, and many other applications.

For most projects it’s enough to use pre-existing word vector embeddings without going through the trouble of generating them yourself.

Are Large Language Models Natural Language Processing?

Large language models (LLMs) are a subset of natural language processing. Training an LLM draws on many of the same techniques and best practices as the rest of NLP, but NLP also addresses a wide variety of other language-based tasks.

Conversational AI is a great case in point. One way of building a conversational agent is by hooking your application up to an LLM like ChatGPT, but you can also do it with a rules-based approach, through grounded learning, or with an ensemble that weaves together several methods.

Getting the Most out of Your NLP Models with Preprocessing

Data Preprocessing for NLP

If you’ve ever sent a well-meaning text that was misinterpreted, you know that language is messy. For this reason, NLP places special demands on the data engineers and data scientists who must transform text in various ways before machine learning algorithms can be trained on it.

In the next few sections, we’ll offer a fairly comprehensive overview of data preprocessing for NLP. This will not cover everything you might encounter in the course of preparing data for your NLP application, but it should be more than enough to get started.

Why is Data Preprocessing Important?

They say that data is the new oil, and just as you can’t put oil directly in your gas tank and expect your car to run, you can’t plow a bunch of garbled, poorly-formatted language data into your algorithms and expect magic to come out the other side.

But what, precisely, counts as preprocessing will depend on your goals. You might choose to omit or include emojis, for example, depending on whether you’re training a model to summarize academic papers or write tweets for you.

That having been said, there are certain steps you can almost always expect to take, including standardizing the case of your language data, removing punctuation, white spaces and stop words, segmenting and tokenizing, etc.

We treat each of these common techniques below.

Segmentation and Tokenization

An NLP model is always trained on some consistent chunk of the full data. When ChatGPT was trained, for example, they didn’t put the entire internet in a big truck and back it up to a server farm, they used self-supervised learning.

Simplifying greatly, this means that the underlying algorithm would take, say, the first few three sentences of a paragraph and then try to predict the remaining sentence on the basis of the text that came before. Over time it sees enough language to guess that “to be or not to be, that is ___ ________” ends with “the question.”

But how was ChatGPT shown the first three sentences? How does that process even work?

A big part of the answer is segmentation and tokenization.

With segmentation, we’re breaking a full corpus of training text – which might contain hundreds of books and millions of words – down into units like words or sentences.

This is far from trivial. In English, sentences end with a period, but words like “Mr.” and “etc.” also contain them. It can be a real challenge to divide text into sentences without also breaking “Mr. Smith is cooking the steak.” into “Mr.” and “Smith is cooking the steak.”

Tokenization is a related process of breaking a corpus down into tokens. Tokens are sometimes described as words, but in truth they can be words, short clusters of a few words, sub-words, or even individual characters.

This matters a lot to the training of your NLP model. You could train a generative language model to predict the next sentence based on the preceding sentences, the next word based on the preceding words, or the next character based on the preceding characters.

Regardless, in both segmentation and tokenization, you’re decomposing a whole bunch of text down into individual units that your algorithm can work with.

Making the Case Consistent

It’s standard practice to make the case of your text consistent throughout, as this makes training simpler. This is usually done by lowercasing all the text, though we suppose if you’re feeling rebellious there’s no reason you couldn’t uppercase it (but the NLP engineers might not invite you to their fun Natural Language Parties if you do.)

Fixing Misspellings

NLP, like machine learning more generally, is only as good as its data. If you feed it text with a lot of errors in spelling, it will learn those errors and they’ll show up again later.

This probably isn’t something you’ll want to do manually, and if you’re using a popular language there’s likely a module you can use to do this for you. Python, for example, has TextBlob, Autocorrect, and Pyspellchecker libraries that can handle spelling errors.

Getting Rid of the Punctuation Marks

Natural language tends to have a lot of punctuation, with English utilizing dozens of marks such as ‘!’ and ‘;’ for emphasis and clarification. These are usually removed as part of preprocessing.

This task is something that can be handled with regular expressions (if you have the patience for it…), or you can do it with an NLP library like Natural Language Toolkit (NLTK).

Expanding the Contractions

Contractions are shortened versions of words, like turning “do not” into “don’t” or “would not” into “wouldn’t”. These, too, can be problematic for NLP algorithms and are usually removed during preprocessing.

Stemming

In linguistics, the stem of a word is its root. The words “runs”, “ran”, and “running” all have the word “run” as their base.

Stemming is one of two approaches for reducing the myriad tenses of a word down into a single basic representation. The other is lemmatization, which we’ll discuss in the next section.

Stemming is the cruder of the two, and is usually done with an algorithm known as Porter’s Stemmer. This stemmer doesn’t always produce the stem you’d expect. “Cats” becomes “cat” while “ponies” becomes “poni”, for example. Nevertheless, this is probably sufficient for basic NLP tasks.

Lemmatization

A more sophisticated version of stemming is lemmatization. A stemmer wouldn’t know the difference between the word “left” in “cookies are ahead and to the left” and “he left the book on the table”, whereas a lemmatizer would.

More generally, a lemmatizer uses language-specific context to handle very subtle distinctions between words, and this means it will usually take longer to run than a stemmer.

Whether it makes sense to use a stemmer or a lemmatizer will depend on the use case you’re interested in. Under most circumstances, lemmatizers are more accurate, and stemmers are faster.

Removing Extra White Spaces

It’ll often be the case that a corpus will have an inconsistent set of spacing conventions. This, too, is something algorithm will learn unless it’s remedied during preprocessing.
Removing Stopwords

This is a big one. “Stopwords” are words like “the” or “is” are all stopwords, and they’re almost always removed before training begins because they don’t add much in the way of useful information.

Because this is done so commonly, you can assume that the NLP library you’re using will have some easy way of doing it. NLTK, for example, has a native list of stopwords that can simply be imported:

from nltk.corpus import stopwords

With this, you can simply exclude the stopwords from the corpus.

Ditching the Digits

If you’re building an NLP application that processes data containing numbers, you’ll probably want to remove that as the training algorithm might end up inserting random digits here and there.

This, alas, is something that will probably need to be done with regular expressions.

Part of Speech Tagging

Part of speech tagging refers to the process of automatically tagging a word with extra grammatical information about whether it’s a noun, verb, etc.

This is certainly not something that you always have to do (we’ve completed a number of NLP projects where it never came up), but it’s still worth understanding what it is.

Supercharging Your NLP Applications

Natural language processing is an enormously powerful constellation of techniques that allow computers to do worthwhile work on text data. It can be used to build question-answering systems, tutors, chatbots, and much more.

But to get the most out of it, you’ll need to preprocess the data. No matter how much computing you have access to, machine learning isn’t of much use with bad data. Techniques like removing stopwords, expanding contractions, and lemmatization create corpora of text that can then be fed to NLP algorithms.

Of course, there’s always an easier way. If you’d rather skip straight to the part where cutting-edge conversational AI directly adds value to your business, you can also reach out to see what the Quiq platform can do.

What Is Transfer Learning? – The Role of Transfer Learning in Building Powerful Generative AI Models

Machine learning is hard work. Sure, it only takes a few minutes to knock out a simple tutorial where you’re training an image classifier on the famous iris dataset, but training a big model to do something truly valuable – like interacting with customers over a chat interface – is a much greater challenge.

Transfer learning offers one possible solution to this problem. By making it possible to train a model in one domain and reuse it in another, transfer learning can reduce demands on your engineering team by a substantial amount.

Today, we’re going to get into transfer learning, defining what it is, how it works, where it can be applied, and the advantages it offers.

Let’s get going!

What is Transfer Learning in AI?

In the abstract, transfer learning refers to any situation in which knowledge from one task, problem, or domain is transferred to another. If you learn how to play the guitar well and then successfully use those same skills to pick up a mandolin, that’s an example of transfer learning.

Speaking specifically about machine learning and artificial intelligence, the idea is very similar. Transfer learning is when you pre-train a model on one task or dataset and then figure out a way to reuse it for another (we’ll talk about methods later).

If you train an image model, for example, it will tend to learn certain low-level features (like curves, edges, and lines) that show up in pretty much all images. This means you could fine-tune the pre-trained model to do something more specialized, like recognizing faces.

Why Transfer Learning is Important in Deep Learning Models

Building a deep neural network requires serious expertise, especially if you’re doing something truly novel or untried.

Transfer learning, while far from trivial, is simply not as taxing. GPT-4 is the kind of project that could only have been tackled by some of Earth’s best engineers, but setting up a fine-tuning pipeline to get it to do good sentiment analysis is a much simpler job.

By lowering the barrier to entry, transfer learning brings advanced AI into reach for a much broader swath of people. For this reason alone, it’s an important development.

Transfer Learning vs. Fine-Tuning

And speaking of fine-tuning, it’s natural to wonder how it’s different from transfer learning.

The simple answer is that fine-tuning is a kind of transfer learning. Transfer learning is a broader concept, and there are other ways to approach it besides fine-tuning.

What are the 5 Types of Transfer Learning?

Broadly speaking, there are five major types of transfer learning, which we’ll discuss in the following sections.

Domain Adaptation

Under the hood, most modern machine learning is really just an application of statistics to particular datasets.

The distribution of the data a particular model sees, therefore, matters a lot. Domain adaptation refers to a family of transfer learning techniques in which a model is (hopefully) trained such that it’s able to handle a shift in distributions from one domain to another (see section 5 of this paper for more technical details).

Domain Confusion

Earlier, we referenced the fact that the layers of a neural network can learn representations of particular features – one layer might be good at detecting curves in images, for example.

It’s possible to structure our training such that a model learns more domain invariant features, i.e. features that are likely to show up across multiple domains of interest. This is known as domain confusion because, in effect, we’re making the domains as similar as possible.

Multitask Learning

Multitask learning is arguably not even a type of transfer learning, but it came up repeatedly in our research, so we’re adding a section about it here.

Multitask learning is what it sounds like; rather than simply training a model on a single task (i.e. detecting humans in images), you attempt to train it to do several things at once.

The debate about whether multitask learning is really transfer learning stems from the fact that transfer learning generally revolves around adapting a pre-trained model to a new task, rather than having it learn to do more than one thing at a time.

One-Shot Learning

One thing that distinguishes machine learning from human learning is that the former requires much more data. A human child will probably only need to see two or three apples before they learn to tell apples from oranges, but an ML model might need to see thousands of examples of each.

But what if that weren’t necessary? The field of one-shot learning addresses itself to the task of learning e.g. object categories from either one example or a small number of them. This idea was pioneered in “One-Shot Learning of Object Categories”, a watershed paper co-authored by Fei-Fei Li and her collaborators. Their Bayesian one-shot learner was able to “…to incorporate prior knowledge of the object world into the learning scheme”, and it outperformed a variety of other models in object recognition tasks.

Zero-Shot Learning

Of course, there might be other tasks (like translating a rare or endangered language), for which it is effectively impossible to have any labeled data for a model to train on. In such a case, you’d want to use zero-shot learning, which is a type of transfer learning.

With zero-shot learning, the basic idea is to learn features in one data set (like images of cats) that allow successful performance on a different data set (like images of dogs). Humans have little problem with this, because we’re able to rapidly learn similarities between types of entities. We can see that dogs and cats both have tails, both have fur, etc. Machines can perform the same feat if the data is structured correctly.

How Does Transfer Learning Work?

There are a few different ways you can go about utilizing transfer learning processes in your own projects.

Perhaps the most basic is to use a good pre-trained model off the shelf as a feature extractor. This would mean keeping the pre-trained model in place, but then replacing its final layer with a layer custom-built for your purposes. You could take the famous AlexNet image classifier, remove its last classification layer, and replace it with your own, for example.

Or, you could fine-tune the pre-trained model instead. This is a more involved engineering task and requires that the pre-trained model be modified internally to be better suited to a narrower application. This will often mean that you have to freeze certain layers in your model so that the weights don’t change, while simultaneously allowing the weights in other layers to change.

What are the Applications of Transfer Learning?

As machine learning and deep learning have grown in importance, so too has transfer learning become more crucial. It now shows up in a variety of different industries. The following are some high-level indications of where you might see transfer learning being applied.

Speech recognition across languages: Teaching machines to recognize and process spoken language is an important area of AI research and will be of special interest to those who operate contact centers. Transfer learning can be used to take a model trained in a language like French and repurpose it for Spanish.

Training general-purpose game engines: If you’ve spent any time playing games like chess or go, you know that they’re fairly different. But, at a high enough level of abstraction, they still share many features in common. That’s why transfer learning can be used to train up a model on one game and, under certain conditions, use it in another.

Object recognition and segmentation: Our Jetsons-like future will take a lot longer to get here if our robots can’t learn to distinguish between basic objects. This is why object recognition and object segmentation are both such important areas of research. Transfer learning is one way of speeding up this process. If models can learn to recognize dogs and then quickly be re-purposed for recognizing muffins, then we’ll soon be able to outsource both pet care and cooking breakfast.

transfer_learning_chihuahua
In fairness to the AI, it’s not like we can really tell them apart!

Applying Natural Language Processing: For a long time, computer vision was the major use case of high-end, high-performance AI. But with the release of ChatGPT and other large language models, NLP has taken center stage. Because much of the modern NLP pipeline involves word vector embeddings, it’s often possible to use a baseline, pre-trained NLP model in applications like topic modeling, document classification, or spicing up your chatbot so it doesn’t sound so much like a machine.

What are the Benefits of Transfer Learning?

Transfer learning has become so popular precisely because it offers so many advantages.

For one thing, it can dramatically reduce the amount of time it takes to train a new model. Because you’re using a pre-trained model as the foundation for a new, task-specific model, far fewer engineering hours have to be spent to get good results.

There are also a variety of situations in which transfer learning can actually improve performance. If you’re using a good pre-trained model that was trained on a general enough dataset, many of the features it learned will carry over to the new task.

This is especially true if you’re working in a domain where there is relatively little data to work with. It might simply not be possible to train a big, cutting-edge model on a limited dataset, but it will often be possible to use a pre-trained model that is fine-tuned on that limited dataset.

What’s more, transfer learning can work to prevent the ever-present problem of overfitting. Overfitting has several definitions depending on what resource you consult, but a common way of thinking about it is when the model is complex enough relative to the data that it begins learning noise instead of just signal.

That means that it may do spectacularly well in training only to generalize poorly when it’s shown fresh data. Transfer learning doesn’t completely rule out this possibility, but it makes it less likely to happen.

Transfer learning also has the advantage of being quite flexible. You can use transfer learning for everything from computer vision to natural language processing, and many domains besides.

Relatedly, transfer learning makes it possible for your model to expand into new frontiers. When done correctly, a pre-trained model can be deployed to solve an entirely new problem, even when the underlying data is very different from what it was shown before.

When To Use Transfer Learning

The list of benefits we just enumerated also offers a clue as to when it makes sense to use transfer learning.

Basically, you should consider using transfer learning whenever you have limited data, limited computing resources, or limited engineering brain cycles you can throw at a problem. This will often wind up being the case, so whenever you’re setting your sights on a new goal, it can make sense to spend some time seeing if you can’t get there more quickly by simply using transfer learning instead of training a bespoke model from scratch.

Check out the second video in Quiq’s LLM Intuitions series—created by our Head of AI, Kyle McIntyre—to learn about one of the oldest forms of transfer learning: Word embeddings.

Transfer Learning and You

In the contact center space, we understand how difficult it can be to effectively apply new technologies to solve our problems. It’s one thing to put together a model for a school project, and quite another to have it tactfully respond to customers who might be frustrated or confused.

Transfer learning is one way that you can get more bang for your engineering buck. By training a model on one task or dataset and using it on another, you can reduce your technical budget while still getting great results.

You could also just rely on us to transfer our decades of learning on your behalf (see what we did there). We’ve built an industry-leading conversational AI chat platform that is changing the game in contact centers. Reach out today to see how Quiq can help you leverage the latest advances in AI, without the hassle.

How Generative AI is Supercharging Contact Center Agents

If you’re reading this, you’ve probably had a chance to play around with ChatGPT or one of the other large language models (LLMs) that have been making waves and headlines in recent months.

Concerns around automation go back a long way, but there’s long been extra worry about the possibility that machines will make human labor redundant. If you’ve used generative AI to draft blog posts or answer technical questions, it’s natural to wonder if perhaps algorithms will soon be poised to replace humans in places like contact centers.

Given how new these LLMs are there has been little scholarship on how they’ve changed the way contact centers function. But “Generative AI at Work” by Erik Brynjolfsson, Danielle Li, and Lindsey R. Raymond took aim at exactly this question.

The results are remarkable. They found that access to tools like ChatGPT not only led to a marked increase in productivity among the lowest-skilled workers, it also had positive impacts on other organizational metrics, like reducing turnover.

Today, we’re going to break this economic study down, examining its methods, its conclusions, and what they mean for the contact centers of the future.

Let’s dig in!

A Look At “Generative AI At Work”

The paper studies data from the use of a conversational AI assistant by a little over 5,000 agents working in customer support.

It contains several major sections, beginning with a technical primer on what generative AI is and how it works before moving on to a discussion of the study’s methods and results.

What is Generative AI?

Covering the technical fundamentals of generative AI will inform our efforts to understand the ways in which this AI technology affected work in the study, as well as how it is likely to do so in future deployments.

A good way to do this is to first grasp how traditional, rules-based programming works, then contrast this with generative AI.

When you write a computer program, you’re essentially creating a logical structure that furnishes instructions the computer can execute.

To take a simple case, you might try to reverse a string such as “Hello world”. One way to do this explicitly is to write code in a language like Python which essentially says:

“Create a new, empty list, then start at the end of the string we gave you and work forward, successively adding each character you encounter to that list before joining all the characters into a reversed string”:

Python code demonstrating a reverse string.

Despite the fact that these are fairly basic instructions, it’s possible to weave them into software that can steer satellites and run banking infrastructure.

But this approach is not suitable for every kind of problem. If you’re trying to programmatically identify pictures of roses, for example, it’s effectively impossible to do this with rules like the ones we used to reverse the string.

Machine learning, however, doesn’t even try to explicitly define any such rules. It works instead by feeding a model many pictures of roses, and “training” it to learn a function that lets it identify new pictures of roses it has never seen before.

Generative AI is a kind of machine learning in which gargantuan models are trained on mind-boggling amounts of text data until they’re able to produce their own, new text. Generative AI is a distinct sub-branch of ML because its purpose is generation, while other kinds of models might be aimed at tasks like classification and prediction.

Is Generative AI The Same Thing As Large Language Models?

At this point, you might be wondering how whether generative AI is the same thing as LLMs. With all the hype and movement in the space, it’s easy to lose track of the terminology.

LLMs are a subset of the broader category of generative AI. All LLMs are generative AI, but there are generative algorithms that work with images, music, chess moves, and other things besides natural language.

How Did The Researchers Study the Effects of Generative AI on Work?

Now we understand that ML learns to recognize patterns, how this is different from classical computer programming, and how generative AI fits into the whole picture.

We can now get to the meat of the study, beginning with how Brynjolfsson, Li, and Raymond actually studied the use of generative AI by workers at a contact center.

The firm from which they drew their data is a Fortune 500 company that creates enterprise software. Its support agents are located mainly in the Phillippines (with a smaller number in the U.S.) to resolve customer issues via a chat interface.

Most of the agent’s job boils down to answering questions from the owners of small businesses that use the firm’s software. Their productivity is assessed via how long it takes them to resolve a given issue (“average handle time”), the fraction of total issues a given agent is able to resolve to the customer’s satisfaction (“resolution rate”), and the net number of customers who would recommend the agent (“net promoter score.”)

Line graphs showing handle time, resolution rate and customer satisfaction using AI.

The AI used by the firm is a version of GPT which has received additional training on conversations between customers and agents. It is mostly used for two things: generating appropriate responses to customers in real-time and surfacing links to the firm’s technical documentation to help answer specific questions about the software.

Bear in mind that this generative AI system is meant to help the agents in performing their jobs. It is not intended to – and is not being trained to – completely replace them. They maintain autonomy in deciding whether and how much of the AI’s suggestions to take.

How Did Generative AI Change Work?

Next, we’ll look at what the study actually uncovered.

There were four main findings, touching on how total worker productivity was impacted, whether productivity gains accrued mainly to low-skill or high-skill workers, how access to an AI tool changed learning on the job, and how the organization changed as a result.

1. Access to Generative AI Boosted Worker Productivity

First, being able to use the firm’s AI tool increased worker productivity by almost 14%. This came from three sources: a reduction in how long it took any given agent to resolve a particular issue, an expansion in the total number of resolutions an agent was able to work on in an hour, and a small jump in the fraction of chats that were completed successfully.

The firm's AI tool increased worker productivity by almost 14%

This boost happened very quickly, showing up in the first month after deployment, growing a little in the second month, and then remaining at roughly that level for the duration of the study.

2. Access to Generative AI Was Most Helpful for Lower-Skilled Agents

Intriguingly, the greatest productivity gains were seen among agents that were relatively low-skill, such as those that were new to the job, with longer-serving, higher-skilled agents seeing virtually none.

The agents in the very bottom quintile for skill level, in fact, were able to resolve 35% more calls per hour—a substantial jump.

The agents in the very bottom quintile for skill level were able to resolve more calls per hour 35%.

With the benefit of hindsight it’s tempting to see these results as obvious, but they’re not. Earlier studies have usually found that the benefits of new computing technologies accrued to the ablest workers, or led firms to raise the bar on skill requirements for different positions.

If it’s true that generative AI is primarily going to benefit less able employees, this fact alone will distinguish it from prior waves of innovation. [1]

3. Access To Generative AI Helps New Workers “Move Down the Learning Curve”

Perhaps the most philosophically interesting conclusion drawn by the study’s authors relates to how generative AI is able to partially learn the tacit knowledge of more skilled workers.

The term “tacit knowledge” refers to the hard-to-articulate behaviors you pick up as you get good at something.

Imagine trying to teach a person how to ride a bike. It’s easy enough to give broad instructions (“check your shoelaces”, “don’t brake too hard”), but there ends up being a billion little subtleties related to foot placement, posture, etc. that are difficult to get into words.

This is true for everything, and it’s part of what distinguishes masters from novices. It’s also a major reason for the fact that many professions have been resistant to full automation.

Remember our discussion of how rule-based programming is poorly suited to tasks where the rules are hard to state? Well, that applies to tasks involving a lot of tacit knowledge. If no one, not even an expert, can tell you precisely what steps to take to replicate their results, then no one is going to be able to program a computer to do it either.

But ML and generative AI don’t face this restriction. With data sets that are big enough and rich enough, the algorithms might be able to capture some of the tacit knowledge expert contact center agents have, e.g. how they phrase replies to customers.

This is suggested by the study’s results. By analyzing the text of customer-agent interactions, the authors found that novice agents using generative AI were able to sound more like experienced agents, which contributed to their success.

4. Access to Generative AI Changed the Way the Organization Functioned

Organizations are profoundly shaped by their workers, and we should expect to see organization-level changes when a new technology dramatically changes how employees operate.

Two major findings from the study were that employee turnover was markedly reduced and there were far fewer customers “escalating” an issue by asking to speak to a supervisor. This could be because agents using generative AI were overall treated much better by customers (who have been known to become frustrated and irate), leading to less stress.

The Contact Center of the Future

Generative AI has already impacted many domains, and this trend will likely only continue going forward. “Generative AI At Work” provides a fascinating glimpse into the way that this technology changed a large contact center by boosting productivity among the least-skilled agents, helping disseminate the hard-won experience of the most-skilled agents, and overall reducing turnover and dissatisfaction.

If this piece has piqued your curiosity about how you can use advanced AI tools for customer-facing applications, schedule a demo of the Quiq conversational CX platform today.

From resolving customer complaints with chatbots to automated text-message follow-ups, we’ve worked hard to build a best-in-class solution for businesses that want to scale with AI.

Let’s see what we can do for you!

[1] See e.g. this quote: “Our paper is related to a large literature on the impact of various forms of technological adoption on worker productivity and the organization of work (e.g. Rosen, 1981; Autor et al., 1998; Athey and Stern, 2002; Bresnahan et al., 2002; Bartel et al., 2007; Acemoglu et al., 2007; Hoffman et al., 2017; Bloom et al., 2014; Michaels et al., 2014; Garicano and Rossi-Hansberg, 2015; Acemoglu and Restrepo, 2020). Many of these studies, particularly those focused on information technologies, find evidence that IT complements higher-skill workers (Akerman et al., 2015; Taniguchi and Yamada, 2022). Bartel et al. (2007) shows that firms that adopt IT tend to use more skilled labor and increase skill requirements for their workers. Acemoglu and Restrepo (2020) study the diffusion of robots and find that the negative employment effects of robots are most pronounced for workers in blue-collar occupations and those with less than a college education. In contrast, we study a different type of technology—generative AI—and find evidence that it most effectively augments lower-skill workers.”

A Guide to Fine-Tuning Pretrained Language Models for Specific Use Cases

Over the past half-year, large language models (LLMs) like ChatGPT have proven remarkably useful for a wide range of tasks, including machine translation, code analysis, and customer interactions in places like contact centers.

For all this power and flexibility, however, it is often still necessary to use fine-tuning to get an LLM to generate high-quality output for specific use cases.

Today, we’re going to do a deep dive into this process, understanding how these models work, what fine-tuning is, and how you can leverage it for your business.

What is a Pretrained Language Model?

First, let’s establish some background context by tackling the question of what pretrained models are and how they work.

The “GPT” in ChatGPT stands for “generative pretrained transformer”, and this gives us a clue as to what’s going on under the hood. ChatGPT is a generative model, meaning its purpose is to create new output; it’s pretrained, meaning that it has already seen a vast amount of text data by the time end users like us get our hands on it; and it’s a transformer, which refers to the fact that it’s built out of billions of transformer modules stacked into layers.

If you’re not conversant in the history of machine learning it can be difficult to see what the big deal is, but pretrained models are a relatively new development. Once upon a time in the ancient past (i.e. 15 or 20 years ago), it was an open question as to whether engineers would be able to pretrain a single model on a dataset and then fine-tune its performance, or whether they would need to approach each new problem by training a model from scratch.

This question was largely resolved around 2013, when image models trained on the ImageNet dataset began sweeping competitions left and right. Since then it has become more common to use pretrained models as a starting point, but we want to emphasize that this approach does not always work. There remain a vast number of important projects for which building a bespoke model is the only way to go.

What is Transfer Learning?

Transfer learning refers to when an agent or system figures out how to solve one kind of problem and then uses this knowledge to solve a different kind of problem. It’s a term that shows up all over artificial intelligence, cognitive psychology, and education theory.

Author, chess master, and martial artist Josh Waitzkin captures the idea nicely in the following passage from his blockbuster book, The Art of Learning:

“Since childhood I had treasured the sublime study of chess, the swim through ever-deepening layers of complexity. I could spend hours at a chessboard and stand up from the experience on fire with insight about chess, basketball, the ocean, psychology, love, art.”

Transfer learning is a broader concept than pretraining, but the two ideas are closely related. In machine learning, competence can be transferred from one domain (generating text) to another (translating between natural languages or creating Python code) by pretraining a sufficiently large model.

What is Fine-Tuning A Pretrained Language Model?

Fine-tuning a pretrained language model occurs when the model is repurposed for a particular task by being shown illustrations of the correct behavior.

If you’re in a whimsical mood, for example, you might give ChatGPT a few dozen limericks so that its future output always has that form.

It’s easy to confuse fine-tuning with a few other techniques for getting optimum performance out of LLMs, so it’s worth getting clear on terminology before we attempt to give a precise definition of fine-tuning.

Fine-Tuning a Language Model v.s. Zero-Shot Learning

Zero-shot learning is whatever you get out of a language model when you feed it a prompt without making any special effort to show it what you want. It’s not technically a form of fine-tuning at all, but it comes up in a lot of these conversations so it needs to be mentioned.

(NOTE: It is sometimes claimed that prompt engineering counts as zero-shot learning, and we’ll have more to say about that shortly.)

Fine-Tuning a Language Model v.s. One-Shot Learning

One-shot learning is showing a language model a single example of what you want it to do. Continuing our limerick example, one-shot learning would be giving the model one limerick and instructing it to format its replies with the same structure.

Fine-Tuning a Language Model v.s. Few-Shot Learning

Few-shot learning is more or less the same thing as one-shot learning, but you give the model several examples of how you want it to act.

How many counts as “several”? There’s no agreed-upon number that we know about, but probably 3 to 5, or perhaps as many as 10. More than this and you’re arguably not doing “few”-shot learning anymore.

Fine-Tuning a Language Model v.s. Prompt Engineering

Large language models like ChatGPT are stochastic and incredibly sensitive to the phrasing of the prompts they’re given. For this reason, it can take a while to develop a sense of how to feed the model instructions such that you get what you’re looking for.

The emerging discipline of prompt engineering is focused on cultivating this intuitive feel. Minor tweaks in word choice, sentence structure, etc. can have an enormous impact on the final output, and prompt engineers are those who have spent the time to learn how to make the most effective prompts (or are willing to just keep tinkering until the output is correct).

Does prompt engineering count as fine-tuning? We would argue that it doesn’t, primarily because we want to reserve the term “fine-tuning” for the more extensive process we describe in the next few sections.

Still, none of this is set in stone, and others might take the opposite view.

Distinguishing Fine-Tuning From Other Approaches

Having discussed prompt engineering and zero-, one-, and few-shot learning, we can give a fuller definition of fine-tuning.

Fine-tuning is taking a pretrained language model and optimizing it for a particular use case by giving it many examples to learn from. How many you ultimately need will depend a lot on your task – particularly how different the task is from the model’s training data and how strict your requirements for its output are – but you should expect it to take on the order of a few dozen or a few hundred examples.

Though it bears an obvious similarity to one-shot and few-shot learning, fine-tuning will generally require more work to come up with enough examples, and you might have to build a rudimentary pipeline that feeds the examples in through the API. It’s almost certainly not something you’ll be doing directly in the ChatGPT web interface.

Contact Us

How Can I Fine-Tune a Pretrained Language Model?

Having gotten this far, we can now turn our attention to what the fine-tuning procedure actually consists in. The basic steps are: deciding what you’re wanting to accomplish, gather the requisite data (and formatting it correctly), feeding it to your model, and evaluating the results.

Let’s discuss each, in turn.

Deciding on Your Use Case

The obvious place to begin is figuring out exactly what it is you want to fine-tune a pretrained model to do.

It may seem as though this is too obvious to be included as its own standalone step, but we’ve singled it out is because you need to think through the specifics of what you’re trying to accomplish. It’s not enough to say “We want to fine-tune this model to write tweets for us”, you have to consider questions like “Should the tone by formal or informal” and “Are we okay with it adding emojis”.

This matters because it informs the data that you gather for fine-tuning. If you feed the model irreverent tweets with lots of 🙂s and 🚀s, you can expect that to show up in the tweets it generates for you.

Picking a Base Model

Next, you have to figure out which model you want to fine-tune. This is not the place to get into a discussion on all the “foundation” models that are available today, but suffice it to say that HuggingFace, OpenAI, Anthropic, and the other companies in the pretrained model space offer a variety of different options. BigScience has BLOOM and T0pp models, OpenAI has Ada, Babbage, Da Vinci, and Curie, and so on.

Different models have different fine-tuning requirements, so this choice will impact how you structure your pipeline later.

Gathering and Formatting the Fine-Tuning Data

The heart of fine-tuning a pretrained model is gathering the samples from which it will learn. This is more or less exactly what it sounds like: you compile a list of prompts and desired responses. This is the data that will adjust the model’s internal parameters, optimizing it for your use case.

AI educator Dave Shapiro has a tutorial on fine-tuning GPT-3 to be a tutor for younger students, and here is one of the training samples he used (“TIM” is the name he’s using for the tutor):

User: Hey TIM, I’m interested in learning more about Charles Darwin. Can you tell me a bit about him?

TIM: Charles Darwin was an English naturalist and geologist who developed the theory of evolution by natural selection. He was born into a wealthy family and was able to attend the prestigious Shrewsbury School. After graduation, he enrolled at the University of Edinburgh to study medicine, but he soon dropped out and enrolled at Christ’s College, Cambridge.

User: Interesting! I didn’t know he was born into a wealthy family.

TIM: Yes, his family was very wealthy and his father was a doctor. Darwin’s grandfather, Erasmus Darwin, was a well-known scientist and wrote several books on botany and evolution.

User: Wow, I had no idea. What made Darwin decide to drop out of medical school and study at Cambridge?

TIM: Darwin was interested in natural history and had developed a keen interest in taxidermy. After dropping out of medical school, he traveled to Wales with some friends to study geology. It was during this time that he began to develop his theory of evolution.

One thing you’ll note is that this sample is pretty long. It could take a while to come up with hundreds of these to fine-tune your model, but that’s actually something a generative pretrained language model like ChatGPT can help with. Just be sure you have a human go over the samples and check them for accuracy, or you risk compromising the quality of your outputs.

Another thing to think about is how you’ll handle adversarial behavior and edge cases. If you’re training a conversational AI chatbot for a contact center, for example, you’ll want to include plenty of instances of the model calmly and politely responding to an irate customer. That way, your output will be similarly calm and polite.

Lastly, you’ll have to format the fine-tuning data according to whatever specifications are required by the base model you’re using. It’ll probably be something similar to JSON, but check the documentation to be sure.

Feeding it to Your Model

Now that you’ve got your samples ready, you’ll have to give them to the model for fine-tuning. This will involve you feeding the examples to the model via its API and waiting until the process has finished.

What is the Difference Between Fine-Tuning and a Pretrained Model?

A pretrained model is one that has been previously trained on a particular dataset or task, and fine-tuning is getting that model to do well on a new task by showing it examples of the output you want to see.

Pretrained models like ChatGPT are often pretty good out of the box, but if you’re wanting it to create legal contracts or work with highly-specialized scientific vocabulary, you’ll likely need to fine-tune it.

Should You Fine-Tune a Pretrained Model For Your Business?

Generative pretrained language models like ChatGPT and Bard have already begun to change the way businesses like contact centers function, and we think this is a trend that is likely to accelerate in the years ahead.

If you’ve been intrigued by the possibility of fine-tuning a pretrained model to supercharge your enterprise, then hopefully the information contained in this article gives you some ideas on how to begin.

Another option is to leverage the power of the Quiq platform. We’ve built a best-in-class conversational AI system that can automate substantial parts of your customer interactions (without you needing to run your own models or set up a fine-tuning pipeline.)

To see how we can help, schedule a demo with us today!

Request A Demo

Brand Voice And Tone Building With Prompt Engineering

Artificial intelligence tools like ChatGPT are changing the way strategists are building their brands.

But with the staggering rate of change in the field, it can be hard to know how to utilize its full potential. Should you hire an engineering team? Pay for a subscription and do it yourself?

The truth is, it depends. But one thing you can try is prompt engineering, a term that refers to carefully crafting the instructions you give to the AI to get the best possible results.

In this piece, we’ll cover the basics of prompt engineering and discuss the many ways in which you can build your brand voice with generative AI.

What is Prompt Engineering?

As the name implies, generative AI refers to any machine learning (ML) model whose primary purpose is to generate some output. There are generative AI applications for creating new images, text, code, and music.

There are also ongoing efforts to expand the range of outputs generative models can handle, such as a fascinating project to build a high-level programming language for creating new protein structures.

The way you get output from a generative AI model is by prompting it. Just as you could prompt a friend by asking “How was your vacation in Japan,” you can prompt a generative model by asking it questions and giving it instructions. Here’s an example:

“I’m working on learning Java, and I want you to act as though you’re an experienced Java teacher. I keep seeing terms like `public class` and `public static void`. Can you explain to me the different types of Java classes, giving an example and explanation of each?”

When we tried this prompt with GPT-4, it responded with a lucid breakdown of different Java classes (i.e., static, inner, abstract, final, etc.), complete with code snippets for each one.

When Small Changes Aren’t So Small

Mapping the relationship between human-generated inputs and machine-generated outputs is what the emerging field of “prompt engineering” is all about.

Prompt engineering only entered popular awareness in the past few years, as a direct consequence of the meteoric rise of large language models (LLMs). It rapidly became obvious that GPT-3.5 was vastly better than pretty much anything that had come before, and there arose a concomitant interest in the best ways of crafting prompts to maximize the effectiveness of these (and similar) tools.

At first glance, it may not be obvious why prompt engineering is a standalone profession. After all, how difficult could it be to simply ask the computer to teach you Chinese or explain a coding concept? Why have a “prompt engineer” instead of a regular engineer who sometimes uses GPT-4 for a particular task?

A lot could be said in reply, but the big complication is the fact that a generative AI’s output is extremely dependent upon the input it receives.

An example pulled from common experience will make this clearer. You’ve no doubt noticed that when you ask people different kinds of questions you elicit different kinds of responses. “What’s up?” won’t get the same reply as “I notice you’ve been distant recently, does that have anything to do with losing your job last month?”

The same basic dynamic applies to LLMs. Just as subtleties in word choice and tone will impact the kind of interaction you have with a person, they’ll impact the kind of interaction you have with a generative model.

All this nuance means that conversing with your fellow human beings is a skill that takes a while to develop, and that also holds in trying to productively using LLMs. You must learn to phrase your queries in a way that gives the model good context, includes specific criteria as to what you’re looking for in a reply, etc.

Honestly, it can feel a little like teaching a bright, eager intern who has almost no initial understanding of the problem you’re trying to get them to solve. If you give them clear instructions with a few examples they’ll probably do alright, but you can’t just point them at a task and set them loose.

We’ll have much more to say about crafting the kinds of prompts that help you build your brand voice in upcoming sections, but first, let’s spend some time breaking down the anatomy of a prompt.

This context will come in handy later.

What’s In A Prompt?

In truth, there are very few real restrictions on how you use an LLM. If you ask it to do something immoral or illegal it’ll probably respond along the lines of “I’m sorry Dave, but as a large language model I can’t let you do that,” otherwise you can just start feeding it text and seeing how it responds.

That having been said, prompt engineers have identified some basic constituent parts that go into useful prompts. They’re worth understanding as you go about using prompt engineering to build your brand voice.

Context

First, it helps to offer the LLM some context for the task you want done. Under most circumstances, it’s enough to give it a sentence or two, though there can be instances in which it makes sense to give it a whole paragraph.

Here’s an example prompt without good context:

“Can you write me a title for a blog post?”

Most human beings wouldn’t be able to do a whole lot with this, and neither can an LLM. Here’s an example prompt with better context:

“I’ve just finished a blog post for a client that makes legal software. It’s about how they have the best payments integrations, and the tone is punchy, irreverent, and fun. Could you write me a title for the post that has the same tone?”

To get exactly what you’re looking for you may need to tinker a bit with this prompt, but you’ll have much better chances with the additional context.

Instructions

Of course, the heart of the matter is the actual instructions you give the LLM. Here’s the context-added prompt from the previous section, whose instructions are just okay:

“I’ve just finished a blog post for a client that makes legal software. It’s about how they have the best payments integrations, and the tone is punchy, irreverent, and fun. Could you write me a title for the post that has the same tone?”

A better way to format the instructions is to ask for several alternatives to choose from:

“I’ve just finished a blog post for a client that makes legal software. It’s about how they have the best payments integrations, and the tone is punchy, irreverent, and fun. Could you give me 2-3 titles for the blog post that have the same tone?”

Here again, it’ll often pay to go through a couple of iterations. You might find – as we did when we tested this prompt – that GPT-4 is just a little too irreverent (it used profanity in one of its titles.) If you feel like this doesn’t strike the right tone for your brand identity you can fix it by asking the LLM to be a bit more serious, or rework the titles to remove the profanity, etc.

You may have noticed that “keep iterating and testing” is a common theme here.

Example Data

Though you won’t always need to get the LLM input data, it is sometimes required (as when you need it to summarize or critique an argument) and is often helpful (as when you give it a few examples of titles you like.)

Here’s the reworked prompt from above, with input data:

“I’ve just finished a blog post for a client that makes legal software. It’s about how they have the best payments integrations, and the tone is punchy, irreverent, and fun. Could you give me 2-3 titles for the blog post that have the same tone?

Here’s a list of two titles that strike the right tone:
When software goes hard: dominating the legal payments game.
Put the ‘prudence’ back in ‘jurisprudence’ by streamlining your payment collections.”

Remember, LLMs are highly sensitive to what you give them as input, and they’ll key off your tone and style. Showing them what you want dramatically boosts the chances that you’ll be able to quickly get what you need.

Output Indicators

An output indicator is essentially any concrete metric you use to specify how you want the output to be structured. Our existing prompt already has one, and we’ve added another (both are bolded):

“I’ve just finished a blog post for a client that makes legal software. It’s about how they have the best payments integrations, and the tone is punchy, irreverent, and fun. Could you give me 2-3 titles for the blog post that have the same tone? Each title should be approximately 60 characters long.

Here’s a list of two titles that strike the right tone:
When software goes hard: dominating the legal payments game.
Put the ‘prudence’ back in ‘jurisprudence’ by streamlining your payment collections.”

As you go about playing with LLMs and perfecting the use of prompt engineering in building your brand voice, you’ll notice that the models don’t always follow these instructions. Sometimes you’ll ask for a five-sentence paragraph that actually contains eight sentences, or you’ll ask for 10 post ideas and get back 12.

We’re not aware of any general way of getting an LLM to consistently, strictly follow instructions. Still, if you include good instructions, clear output indicators, and examples, you’ll probably get close enough that only a little further tinkering is required.

What Are The Different Types of Prompts You Can Use For Prompt Engineering?

Though prompt engineering for tasks like brand voice and tone building is still in its infancy, there are nevertheless a few broad types of prompts that are worth knowing.

  • Zero-shot prompting: A zero-shot prompt is one in which you simply ask directly for what you want without providing any examples. It’ll simply generate an output on the basis of its internal weights and prior training, and, surprisingly, this is often more than sufficient.
  • One-shot prompting: With a one-shot prompt, you’re asking the LLM for output and giving it a single example to learn from.
  • Few-shot prompting: Few-shot prompts involve a least a few examples of expected output, as in the two titles we provided our prompt when we asked it for blog post titles.
  • Chain-of-thought prompting: Chain-of-thought prompting is similar to few-shot prompting, but with a twist. Rather than merely giving the model examples of what you want to see, you craft your examples such that they demonstrate a process of explicit reasoning. When done correctly, the model will actually walk through the process it uses to reason about a task. Not only does this make its output more interpretable, but it can also boost accuracy in domains at which LLMs are notoriously bad, like addition.

What Are The Challenges With Prompt Engineering For Brand Voice?

We don’t use the word “dazzling” lightly around here, but that’s the best way of describing the power of ChatGPT and the broader ecosystem of large language models.

You would be hard-pressed to find many people who have spent time with one and come away unmoved.

Still, challenges remain, especially when it comes to using prompt engineering for content marketing or building your brand voice.

One well-known problem is the tendency of LLMs to completely make things up, a phenomenon referred to as “hallucination”. The internet is now filled with examples of ChatGPT completely fabricating URLs, books, papers, professions, and individuals. If you use an LLM to create content for your website and don’t thoroughly vet it, you run the risk of damaging your reputation and your brand if it contains false or misleading information.

A related problem is legal or compliance issues that emerge as a result of using an LLM. Though the technology hasn’t been around long enough to get anyone into serious trouble (we suspect it won’t be long), there are now cases in which attorneys have been caught using faulty research generated by ChatGPT or engineering teams have leaked proprietary secrets by feeding meeting notes into it.

Finally, if you’re offering a fine-tuned model to customers to do something like answer questions, you must be very, very careful in delimiting its scope so that it doesn’t generate unwanted behavior. It’s pretty easy to accidentally wander into fraught territory when engaging with an LLM in an open-ended manner, and that’s not even counting users who deliberately try to get it to respond inappropriately.

One potential solution to this problem is by crafting your prompts such that they contain clear instructions about what not to do. You may tell it not to discuss its own rules, not to change its tone, not to speak negatively about anyone, not to argue, etc.

Crafting a prompt that illustrates the correct behavior while explicitly ruling out any incorrect behaviors is a non-trivial task, requiring a great deal of testing and refinement. But it’s one you’ll have to get right if you want to leverage LLMs for your brand voice while avoiding any possible harm down the line.

Prompt Engineering For Marketers

With all this context having been established, you’re no doubt excited about using prompt engineering in your marketing efforts. In this section, we’ll go through a couple of ways you can do that.

Coming up with ideas

One place where LLMs excel is in helping you with ideation. A lot goes into content marketing, but all of that follows from having a solid list of basic concepts to work from.

We tested ChatGPT’s functionality by feeding it this query:

“I’m starting a blog about becoming a data scientist. Could you give me a list of 10 blog post ideas, with a mix of beginner, intermediate, and advanced topics? The titles should be eye-catching but not too aggressive.

Here are two examples of good blog post titles:

‘Data science: what to know when considering a career in this hot new field.’
‘Data science v.s. data engineering: everything you wanted to know’”

And we got back some excellent ideas:

  1. “Demystifying Data Science: A Beginner’s Guide to Jumpstarting Your Career”
  2. “From Novice to Ninja: Essential Skills for Every Data Science Aspirant”
  3. “Unlocking the Power of Python: A Data Scientist’s Best Friend”
  4. “Exploratory Data Analysis: Unveiling Insights Hidden in the Numbers”
  5. “Machine Learning 101: A Gentle Introduction for Data Science Enthusiasts”
  6. “Feature Engineering: Enhancing Your Models Like a Pro”
  7. “The Art of Natural Language Processing: Extracting Meaning from Text”
  8. “Unsupervised Learning: Unveiling Patterns and Structure in Your Data”
  9. “Deep Dive into Neural Networks: Building Advanced Models for Predictive Analytics”
  10. “Data Science in the Real World: Tackling Complex Business Problems with Analytics”

And look, it even generated the correct number of titles! You’ll notice, though that every single one follows the same template, with an initial phrase, a colon, and then a follow-on phrase. That’s because the two examples I gave have this structure, and that’s what it used as a guide.

Performing Background Research

After you’re done coming up with good ideas for building your brand voice, you’ll likely want to do some additional background research on your domain. This, too, is a place where LLMs excel.

There can be a lot of subtlety to this. You might start with something obvious, like “give me a list of the top authors in the keto diet niche”, but you can also get more creative than this. We’ve heard of copywriters who have used GPT-3.5 to generate lengthy customer testimonials for fictional products, or diary entries for i.e. 40-year-old suburban dads who are into DIY home improvement projects.

Regardless, with a little bit of ingenuity, you can generate a tremendous amount of valuable research that can inform your attempts to develop a brand voice.

Be careful, though; this is one place where model hallucinations could be really problematic. Be sure to manually check a model’s outputs before using them for anything critical.

Generating Actual Content

Of course, one place where content marketers are using LLMs more often is in actually writing full-fledged content. We’re of the opinion that GPT-3.5 is still not at the level of a skilled human writer, but it’s excellent for creating outlines, generating email blasts, and writing relatively boilerplate introductions and conclusions.

Getting better at prompt engineering

Despite the word “engineering” in its title, prompt engineering remains as much an art as it is a science. Hopefully, the tips we’ve provided here will help you structure your prompts in a way that gets you good results, but there’s no substitute for practicing the way you interact with LLMs.

One way to approach this task is by paying careful attention to the ways in which small word choices impact the kinds of output generated. You could begin developing an intuitive feel for the relationship between input text and output text by simply starting multiple sessions with ChatGPT and trying out slight variations of prompts. If you really want to be scientific about it, copy everything over into a spreadsheet and look for patterns. Over time, you’ll become more and more precise in your instructions, just as an experienced teacher or manager does.

Prompt Engineering Can Help You Build Your Brand

Advanced AI models like ChatGPT are changing the way SEO, content marketing, and brand strategy are being done. From creating buyer personas to using chatbots for customer interactions, these tools can help you get far more work done with less effort.

But you have to be cautious, as LLMs are known to hallucinate information, change their tone, and otherwise behave inappropriately.

With the right prompt engineering expertise, these downsides can be ameliorated, and you’ll be on your way to building a strong brand. If you’re interested in other ways AI tools can take your business to the next level, schedule a demo of Quiq’s conversational CX platform today!

Contact Us

LLMs For the Enterprise: How to Protect Brand Safety While Building Your Brand Persona

It’s long been clear that advances in artificial intelligence change how businesses operate. Whether it’s extremely accurate machine translation, chatbots that automate customer service tasks, or spot-on recommendations for music and shows, enterprises have been using advanced AI systems to better serve their customers and boost their bottom line for years.

Today the big news is generative AI, with large language models (LLMs) in particular capturing the imagination. As we’d expect, businesses in many different industries are enthusiastically looking at incorporating these tools into their workflows, just as prior generations did for the internet, computers, and fax machines.

But this alacrity must be balanced with a clear understanding of the tradeoffs involved. It’s one thing to have a language model answer simple questions, and quite another to have one engaging in open-ended interactions with customers involving little direct human oversight.

If you have an LLM-powered application and it goes off the rails, it could be mildly funny, or it could do serious damage to your brand persona. You need to think through both possibilities before proceeding.

This piece is intended as a primer on effectively using LLMs for the enterprise. If you’re considering integrating LLMs for specific applications and aren’t sure how to weigh the pros and cons, it will provide invaluable advice on the different options available while furnishing the context you need to decide which is the best fit for you.

How Are LLMs Being Used in Business?

LLMs like GPT-4 are truly remarkable artifacts. They’re essentially gigantic neural networks with billions of internal parameters, trained on vast amounts of text data from books and the internet.

Once they’re ready to go, they can be used to ask and answer questions, suggest experiments or research ideas, write code, write blog posts, and perform many other tasks.

Their flexibility, in fact, has come as quite a surprise, which is why they’re showing up in so many places. Before we talk about specific strategies for integrating LLMs into your enterprise, let’s walk through a few business use cases for the technology.

Generating (or rewriting) text

The obvious use case is generating text. GPT-4 and related technologies are very good at writing generic blog posts, copy, and emails. But they’ve also proven useful in more subtle tasks, like producing technical documentation or explaining how pieces of code work.

Sometimes it makes sense to pass this entire job on to LLMs, but in other cases, they can act more like research assistants, generating ideas or taking human-generated bullet points and expanding on them. It really depends on the specifics of what you’re trying to accomplish.

Conversational AI

A subcategory of text generation is using an LLM as a conversational AI agent. Clients or other interested parties may have questions about your product, for instance, and many of them can be answered by a properly fine-tuned LLM instead of by a human. This is a use case where you need to think carefully about protecting your brand persona because LLMs are flexible enough to generate inappropriate responses to questions. You should extensively test any models meant to interact with customers and be sure your tests include belligerent or aggressive language to verify that the model continues to be polite.

Summarizing content

Another place that LLMs have excelled is in summarizing already-existing text. This, too, is something that once would’ve been handled by a human, but can now be scaled up to the greater speed and flexibility of LLMs. People are using LLMs to summarize everything from basic articles on the internet to dense scientific and legal documents (though it’s worth being careful here, as they’re known to sometimes include inaccurate information in these summaries.)

Answering questions

Though it might still be a while before ChatGPT is able to replace Google, it has become more common to simply ask it for help rather than search for the answer online. Programmers, for example, can copy and paste the error messages produced by their malfunctioning code into ChatGPT to get its advice on how to proceed. The same considerations around protecting brand safety that we mentioned in the ‘conversational AI’ section above apply here as well.

Classification

One way to get a handle on a huge amount of data is to use a classification algorithm to sort it into categories. Once you know a data point belongs in a particular bucket you already know a fair bit about it, which can cut down on the amount of time you need to spend on analysis. Classifying documents, tweets, etc. is something LLMs can help with, though at this point a fair bit of technical work is required to get models like GPT-3 to reliably and accurately handle classification tasks.

Sentiment analysis

Sentiment analysis refers to a kind of machine learning in which the overall tone of a piece of text is identified (i.e. is it happy, sarcastic, excited, etc.) It’s not exactly the same thing as classification, but it’s related. Sentiment analysis shows up in many customer-facing applications because you need to know how people are responding to your new brand persona or how they like an update to your core offering, and this is something LLMs have proven useful for.

What Are the Advantages of Using LLMs in Business?

More and more businesses are investigating LLMs for their specific applications because they confer many advantages to those that know how to use them.

For one thing, LLMs are extremely well-suited for certain domains. Though they’re still prone to hallucinations and other problems, LLMs can generate high-quality blog posts, emails, and general copy. At present, the output is usually still not as good as what a skilled human can produce.

But LLMs can generate text so quickly that it often makes sense to have the first draft created by a model and tweaked by a human, or to have relatively low-effort tasks (like generating headlines for social media) delegated to a machine so a human writer can focus on more valuable endeavors.

For another, LLMs are highly flexible. It’s relatively straightforward to take a baseline LLM like GPT-4 and feed it examples of behavior you want to see, such as generating math proofs in the form of poetry (if you’re into that sort of thing.) This can be done with prompt engineering or with a more sophisticated pipeline involving the model’s API, but in either case, you have the option of effectively pointing these general-purpose tools at specific tasks.

None of this is to suggest that LLMs are always and everywhere the right tool for the job. Still, in many domains, it makes sense to examine using LLMs for the enterprise.

What Are the Disadvantages of Using LLMs in Business?

For all their power, flexibility, and jaw-dropping speed, there are nevertheless drawbacks to using LLMs.

One disadvantage of using LLMs in business that people are already familiar with is the variable quality of their output. Sometimes, the text generated by an LLM is almost breathtakingly good. But LLMs can also be biased and inaccurate, and their hallucinations – which may not be a big deal for SEO blog posts – will be a huge liability if they end up damaging your brand.

Exacerbating this problem is the fact that no matter how right or wrong GPT-4 is, it’ll format its response in flawless, confident prose. You might expect a human being who doesn’t understand medicine very well to misspell a specialized word like “Umeclidinium bromide”, and that would offer you a clue that there might be other inaccuracies. But that essentially never happens with an LLM, so special diligence must be exercised in fact-checking their claims.

There can also be substantial operational costs associated with training and using LLMs. If you put together a team to build your own internal LLM you should expect to spend (at least) hundreds of thousands of dollars getting it up and running, to say nothing of the ongoing costs of maintenance.

Of course, you could also build your applications around API calls to external parties like OpenAI, who offer their models’ inferences as an endpoint. This is vastly cheaper, but it comes with downsides of its own. Using this approach means being beholden to another entity, which may release updates that dramatically change the performance of their models and materially impact your business.

Perhaps the biggest underlying disadvantage to using LLMs, however, is their sheer inscrutability. True, it’s not that hard to understand at a high level how models like GPT-4 are trained. But the fact remains that no one really understands what’s happening inside of them. It’s usually not clear why tiny changes to a prompt can result in such wildly different outputs, for example, or why a prompt will work well for a while before performance suddenly starts to decline.

Perhaps you just got unlucky – these models are stochastic, after all – or perhaps OpenAI changed the base model. You might not be able to tell, and either way, it’s hard to build robust, long-range applications around technologies that are difficult to understand and predict.

Contact Us

How Can LLMs Be Integrated Into Enterprise Applications?

If you’ve decided you want to integrate these groundbreaking technologies into your own platforms, there are two basic ways you can proceed. Either you can use a 3rd-party service through an API, or you can try to run your own models instead.

In the following two sections, we’ll cover each of these options and their respective tradeoffs.

Using an LLM through an API

An obvious way of leveraging the power of LLMs is by simply including API calls to a platform that specializes in them, such as OpenAI. Generally, this will involve creating infrastructure that is able to pass a prompt to an LLM and return its output.

If you’re building a user-facing chatbot through this method, that would mean that whenever the user types a question, their question is sent to the model and its response is sent back to the user.

The advantages of this approach are that they offer an extremely low barrier to entry, low costs, and fast response times. Hitting an API is pretty trivial as engineering tasks go, and though you’re charged per token, the bill will surely be less than it would be to stand up an entire machine-learning team to build your own model.

But, of course, the danger is that you’re relying on someone else to deliver crucial functionality. If OpenAI changes its terms of service or simply goes bankrupt, you could find yourself in a very bad spot.

Another disadvantage is that the company running the model may have access to the data you’re passing to its models. A team at Samsung recently made headlines when it was discovered they’d been plowing sensitive meeting notes and proprietary source code directly into ChatGPT, where both were viewable by OpenAI. You should always be careful about the data you’re exposing, particularly if it’s customer data whose privacy you’ve been entrusted to protect.

Running Your Own Model

The way to ameliorate the problems of accessing an LLM through an API is to either roll your own or run an open-source model in an environment that you control.

Building the kind of model that can compete with GPT-4 is really, really difficult, and it simply won’t be an option for any but the most elite engineering teams.

Using an open-source LLM, however, is a much more viable option. There are now many such models for text or code generation, and they can be fine-tuned for the specifics of your use case.

By and large, open-source models tend to be smaller and less performant than their closed-source cousins, so you’ll have to decide whether they’re good enough for you. And you should absolutely not underestimate the complexity of maintaining an open-sourced LLM. Though it’s nowhere near as hard as training one from scratch, maintaining an advanced piece of AI software is far from a trivial task.

All that having been said, this is one path you can take if you have the right applications in mind and the technical skills to pull it off.

How to Protect Brand Safety While Building Your Brand Persona

Throughout this piece, we’ve made mention of various ways in which LLMs can help supercharge your business while also warning of the potential damage a bad LLM response can do to your brand.

At present, there is no general-purpose way of making sure an LLM only does good things while never doing bad things. They can be startlingly creative, and with that power comes the possibility that they’ll be creative in ways you’d rather them not be (same as children, we suppose.)

Still, it is possible to put together an extensive testing suite that substantially reduces the possibility of a damaging incident. You need to feed the model many different kinds of interactions, including ones that are angry, annoyed, sarcastic, poorly spelled or formatted, etc., to see how it behaves.

What’s more, this testing needs to be ongoing. It’s not enough to run a test suite one weekend and declare the model fit for use, it needs to be periodically re-tested to ensure no bad behavior has emerged.

With these techniques, you should be able to build a persona as a company on the cutting edge while protecting yourself from incidents that damage your brand.

What Is the Future of LLMs and AI?

The business world moves fast, and if you’re not keeping up with the latest advances you run the risk of being left behind. At present, large language models like GPT-4 are setting the world ablaze with discussions of their potential to completely transform fields like customer experience chatbots.

If you want in on the action and you have the in-house engineering expertise, you could try to create your own offering. But if you would rather leverage the power of LLMs for chat-based applications by working with a world-class team that’s already done the hard engineering work, reach out to Quiq to schedule a demo.

Request A Demo

Semi-Supervised Learning Explained (With Examples)

From movie recommendations to chatbots as customer service reps, it seems like machine learning (ML) is absolutely everywhere. But one thing you may not realize is just how much data is required to train these advanced systems, and how much time and energy goes into formatting that data appropriately.

Machine learning engineers have developed many ways of trying to cut down on this bottleneck, and one of the techniques that have emerged from these efforts is semi-supervised learning.

Today, we’re going to discuss semi-supervised learning, how it works, and where it’s being applied.

What is Semi-Supervised Learning?

Semi-supervised learning (SSL) is an approach to machine learning (ML) that is appropriate for tasks where you have a large amount of data that you want to learn from, only a fraction of which is labeled.

Semi-supervised learning sits somewhere between supervised and unsupervised learning, and we’ll start by understanding these techniques because that will make it easier to grasp how semi-supervised learning works.

Supervised learning refers to any ML setup in which a model learns from labeled data. It’s called “supervised” because the model is effectively being trained by showing it many examples of the right answer.

Suppose you’re trying to build a neural network that can take a picture of different plant species and classify them. If you give it a picture of a rose it’ll output the “rose” label, if you give it a fern it’ll output the “fern” label, and so on.

The way to start training such a network is to assemble many labeled images of each kind of plant you’re interested in. You’ll need dozens or hundreds of such images, and they’ll each need to be labeled by a human.

Then, you’ll assemble these into a dataset and train your model on it. What the neural network will do is learn some kind of function that maps features in the image (the concentrations of different colors, say, or the shape of the stems and leaves) to a label (“rose”, “fern”.)

One drawback to this approach is that it can be slow and extremely expensive, both in funds and in time. You could probably put together a labeled dataset of a few hundred plant images in a weekend, but what if you’re training something more complex, where the stakes are higher? A model trained to spot breast cancer from a scan will need thousands of images, perhaps tens of thousands. And not just anyone can identify a cancerous lump, you’ll need a skilled human to look at the scan to label it “cancerous” and “non-cancerous.”

Unsupervised learning, by contrast, requires no such labeled data. Instead, an unsupervised machine learning algorithm is able to ingest data, analyze its underlying structure, and categorize data points according to this learned structure.

Semi-supervised learning

Okay, so what does this mean? A fairly common unsupervised learning task is clustering a corpus of documents thematically, and let’s say you want to do this with a bunch of different national anthems (hey, we’re not going to judge you for how you like to spend your afternoons!).

A good, basic algorithm for a task like this is the k-means algorithm, so-called because it will sort documents into k categories. K-means begins by randomly initializing k “centroids” (which you can think of as essentially being the center value for a given category), then moving these centroids around in an attempt to reduce the distance between the centroids and the values in the clusters.

This process will often involve a lot of fiddling. Since you don’t actually know the optimal number of clusters (remember that this is an unsupervised task), you might have to try several different values of k before you get results that are sensible.

To sort our national anthems into clusters you’ll have to first pre-process the text in various ways, then you’ll run it through the k-means clustering algorithm. Once that is done, you can start examining the clusters for themes. You might find that one cluster features words like “beauty”, “heart” and “mother”, another features words like “free” and “fight”, another features words like “guard” and “honor”, etc.

As with supervised learning, unsupervised learning has drawbacks. With a clustering task like the one just described, it might take a lot of work and multiple false starts to find a value of k that gives good results. And it’s not always obvious what the clusters actually mean. Sometimes there will be clear features that distinguish one cluster from another, but other times they won’t correspond to anything that’s easily interpretable from a human perspective.

Semi-supervised learning, by contrast, combines elements of both of these approaches. You start by training a model on the subset of your data that is labeled, then apply it to the larger unlabeled part of your data. In theory, this should simultaneously give you a powerful predictive model that is able to generalize to data it hasn’t seen before while saving you from the toil of creating thousands of your own labels.

How Does Semi-Supervised Learning Work?

We’ve covered a lot of ground, so let’s review. Two of the most common forms of machine learning are supervised learning and unsupervised learning. The former tends to require a lot of labeled data to produce a useful model, while the latter can soak up a lot of hours in tinkering and yield clusters that are hard to understand. By training a model on a labeled subset of data and then applying it to the unlabeled data, you can save yourself tremendous amounts of effort.

But what’s actually happening under the hood?

Three main variants of semi-supervised learning are self-training, co-training, and graph-based label propagation, and we’ll discuss each of these in turn.

Self-training

Self-training is the simplest kind of semi-supervised learning, and it works like this.

A small subset of your data will have labels while the rest won’t have any, so you’ll begin by using supervised learning to train a model on the labeled data. With this model, you’ll go over the unlabeled data to generate pseudo-labels, so-called because they are machine-generated and not human-generated.

Now, you have a new dataset; a fraction of it has human-generated labels while the rest contains machine-generated pseudo-labels, but all the data points now have some kind of label and a model can be trained on them.

Co-training

Co-training has the same basic flavor as self-training, but it has more moving parts. With co-training you’re going to train two models on the labeled data, each on a different set of features (in the literature these are called “views”.)

If we’re still working on that plant classifier from before, one model might be trained on the number of leaves or petals, while another might be trained on their color.

At any rate, now you have a pair of models trained on different views of the labeled data. These models will then generate pseudo-labels for all the unlabeled datasets. When one of the models is very confident in its pseudo-label (i.e., when the probability it assigns to its prediction is very high), that pseudo-label will be used to update the prediction of the other model, and vice versa.

Let’s say both models come to an image of a rose. The first model thinks it’s a rose with 95% probability, while the other thinks it’s a tulip with a 68% probability. Since the first model seems really sure of itself, its label is used to change the label on the other model.

Think of it like studying a complex subject with a friend. Sometimes a given topic will make more sense to you, and you’ll have to explain it to your friend. Other times they’ll have a better handle on it, and you’ll have to learn from them.

In the end, you’ll both have made each other stronger, and you’ll get more done together than you would’ve done alone. Co-training attempts to utilize the same basic dynamic with ML models.

Graph-based semi-supervised learning

Another way to apply labels to unlabeled data is by utilizing a graph data structure. A graph is a set of nodes (in graph theory we call them “vertices”) which are linked together through “edges.” The cities on a map would be vertices, and the highways linking them would be edges.

If you put your labeled and unlabeled data on a graph, you can propagate the labels throughout by counting the number of pathways from a given unlabeled node to the labeled nodes.

Imagine that we’ve got our fern and rose images in a graph, together with a bunch of other unlabeled plant images. We can choose one of those unlabeled nodes and count up how many ways we can reach all the “rose” nodes and all the “fern” nodes. If there are more paths to a rose node than a fern node, we classify the unlabeled node as a “rose”, and vice versa. This gives us a powerful alternative means by which to algorithmically generate labels for unlabeled data.

Contact Us

Semi-Supervised Learning Examples

The amount of data in the world is increasing at a staggering rate, while the number of human-hours available for labeling it all is increasing at a much less impressive clip. This presents a problem because there’s no end to the places where we want to apply machine learning.

Semi-supervised learning presents a possible solution to this dilemma, and in the next few sections, we’ll describe semi-supervised learning examples in real life.

  • Identifying cases of fraud: In finance, semi-supervised learning can be used to train systems for identifying cases of fraud or extortion. Rather than hand-labeling thousands of individual instances, engineers can start with a few labeled examples and proceed with one of the semi-supervised learning approaches described above.
  • Classifying content on the web: The internet is a big place, and new websites are put up all the time. In order to serve useful search results it’s necessary to classify huge amounts of this web content, which can be done with semi-supervised learning.
  • Analyzing audio and images: This is perhaps the most popular use of semi-supervised learning. When audio files or image files are generated they’re often not labeled, which makes it difficult to use them for machine learning. Beginning with a small subset of human-labeled data, however, this problem can be overcome.

How Is Semi-Supervised Learning Different From…?

With all the different approaches to machine learning, it can be easy to confuse them. To make sure you fully understand semi-supervised learning, let’s take a moment to distinguish it from similar techniques.

Semi-Supervised Learning vs Self-Supervised Learning

With semi-supervised learning you’re training a model on a subset of labeled data and then using this model to process the unlabeled data. Self-supervised learning is different in that it’s showing an algorithm some fraction of the data (say the first 80 words in a paragraph) and then having it predict the remainder (the other 20 words in a paragraph.)

Self-supervised learning is how LLMs like GPT-4 are trained.

Semi-Supervised Learning vs Reinforcement Learning

One interesting subcategory of ML we haven’t discussed yet is reinforcement learning (RL). RL involves leveraging the mathematics of sequential decision theory (usually a Markov Decision Process) to train an agent to interact with its environment in a dynamic, open-ended way.

It bears little resemblance to semi-supervised learning, and the two should not be confused.

Semi-Supervised Learning vs Active Learning

Active learning is a type of semi-supervised learning. The big difference is that, with active learning, the algorithm will send its lowest-confidence pseudo-labels to a human for correction.

When Should You Use Semi-Supervised Learning?

Semi-supervised learning is a way of training ML models when you only have a small amount of labeled data. By training the model on just the labeled subset of data and using it in a clever way to label the rest, you can avoid the difficulty of having a human being label everything.

There are many situations in which semi-supervised learning can help you make use of more of your data. That’s why it has found widespread use in domains as diverse as document classification, fraud, and image identification.

So long as you’re considering ways of using advanced AI systems to take your business to the next level, check out our generative AI resource hub to go even deeper. This technology is changing everything, and if you don’t want to be left behind, set up a time to talk with us.

Request A Demo