Test, optimize, manage, and automate with AI. Take a free test drive of Quiq's AI Studio. Test drive AI Studio -->

Reinforcement Learning from Human Feedback

Person using a laptop to research reinforcement learning

ChatGPT – and other large language models like it – are already transforming education, healthcare, software engineering, and the work being done in contact centers.

We’ve written extensively about how self-supervised learning is used to train these models, but one thing we haven’t spent much time on is reinforcement learning from human feedback (RLHF).

Today, we’re rectifying that. We’re going to dive into what reinforcement learning from human feedback is, why it’s important, and how it works.

With that done, you’ll have received a thorough education in this world-changing technology.

What is Reinforcement Learning from Human Feedback?

As you no doubt surmised from its name, reinforcement learning from human feedback involves two components: reinforcement learning and human feedback. Though the technical specifics are (as usual) very involved, the basic idea is simple: you have models produce output, humans rate the output that they prefer (based on its friendliness, completeness, accuracy, etc.), and then the model is updated accordingly.

It’ll help if we begin by talking about what reinforcement learning is. This background will prove useful in understanding the unfolding of the broader process.

What is Reinforcement Learning?

There are four widespread approaches to getting intelligent behavior from machines: supervised learning, unsupervised learning, semi-supervised learning, and reinforcement learning.

With supervised learning, you feed a statistical algorithm a bunch of examples of correctly-labeled data in the hope that it will generalize to further examples it hasn’t seen before. Regression and supervised classification models are standard applications of supervised learning.

Unsupervised learning is a similar idea, but you forego the labels. It’s used for certain kinds of clustering tasks, and for applications like dimensionality reduction.

Semi-supervised learning is a combination of these two approaches. Suppose you have a gigantic body of photographs, and you want to develop an automated system to tag them. If some of them are tagged then your system can use those tags to learn a pattern, which can then be applied to the rest of the untagged images.

Finally, there’s reinforcement learning (RL). Reinforcement learning is entirely different. With reinforcement learning, you’re usually setting up an environment (like a video game), and putting an agent in the environment with a reward structure that tells it which actions are good and which are bad. If the agent successfully flies a spaceship through a series of rings, for example, that might be worth +10 points each, completing an entire level might be worth +100, crashing might be worth -1,000, and so on.

The idea is that, over time, the reinforcement learning agent will learn to execute a strategy that maximizes its long-term reward. It’ll realize that rings are worth a few points and so it should fly through them, it’ll learn that it should try to complete a level because that’s a huge reward bonus, it’ll learn that crashing is bad, etc.

Reinforcement learning is far more powerful than other kinds of machine learning; when done correctly, it can lead to agents able to play the stock market, run procedures in a factory, and do a staggering variety of other tasks.

What are the Steps of Reinforcement Learning from Human Feedback?

Now that we know a little bit about reinforcement learning, let’s turn to a discussion of reinforcement learning from human feedback.

As we just described, reinforcement learning agents have to be trained like any other machine learning system. Under normal circumstances, this doesn’t involve any human feedback. A programmer will update the code, environment, or reward structure between training runs, but they don’t usually provide feedback directly to the agent.

Except, that is, in the case of reinforcement learning from human feedback, in which case that’s exactly what happens. A model will produce a set of outputs, and humans will rank them. Over time the model will adjust to making more and more appropriate responses, as judged by the human raters providing them with feedback.

Sometimes, this feedback can be for something relatively prosaic. It’s been used, for example, to get RL agents to execute backflips in simulated environments. The raters will look at short videos of two movements and select the one that looks like it’s getting closer to a backflip; with enough time, this gets the agent to actually do one.

Or, it can be used for something more nuanced, such as getting a large language model to produce more conversational dialogue. This is part of how ChatGPT was trained.

Why is Reinforcement Learning from Human Feedback Necessary?

ChatGPT is already being used to great effect in contact centers and the customer service arena more broadly. Here are some example applications:

  • Question answering: ChatGPT is exceptionally good at answering questions. What’s more, some companies have begun fine-tuning it on their own internal and external documentation, so that people can directly ask it questions about how a product works or how to solve an issue. This obviates the need to go hunting around inside the docs.
  • Summarization: Similarly, ChatGPT can be used to summarize video transcripts, email threads, and lengthy articles so that agents (or customers) can get through the material at a much greater clip. This can, for example, help agents stay abreast of what’s going on in other parts of the company without burdening them with the need to read constantly. Quiq has custom-built tools for performing exactly this function.
  • Onboarding new hires: Together, question-answering and summarization are helping new contact center agents get up to speed much more quickly when they start their jobs.
    Sentiment analysis: Sentiment analysis refers to classifying a text according to its sentiment, i.e. whether it’s “positive”, “negative”, or “neutral”. Sentiment analysis comes in several different flavors, including granular and aspect-spaced, and ChatGPT can help with all of them. Being able to automatically tag a customer issue comes in handy when you’re trying to sort and prioritize them.
  • Real-time language translation: If your product or service has an international audience, then you might need to avail yourself of translation services so that agents and customers are speaking the same language. There are many such services available, but ChatGPT has proven to be at least as good as almost all of them.

In aggregate, these and other use cases of large language models are making contact center agents much more productive. But contact center agents have to interact with customers in a certain way – they have to be polite, helpful, etc.

And out of the box, most large language models do not behave that way. We’ve already had several high-profile incidents in which a language model e.g. asked a reporter to end his marriage or falsely accused a law school professor of sexual harassment.

Reinforcement learning from human feedback is currently the most promising approach for tuning this toxic and harmful behavior out of large language models. The only reason they’re able to help contact center agents so much is that they’ve been fine-tuned with such an approach; otherwise, agents would be spending an inordinate amount of time rephrasing and tinkering with a model’s output to get it to be appropriately friendly.

This is why reinforcement learning from human feedback is important for the managers of contact centers to understand – it’s a major part of why large language models are so useful in the first place.

Applications of Reinforcement Learning from Human Feedback

To round out our picture, we’re going to discuss a few ways in which reinforcement learning from human feedback is actually used in the wild. We’ve already discussed how it is fine-tuning models to be more helpful in the context of a contact center, and we’ll now talk a bit about how it’s used in gaming and robotics.

Using Reinforcement Learning from Human Feedback in Games

Gaming has long been one of the ideal testing grounds for new approaches to artificial intelligence. As you might expect, it’s also a place where reinforcement learning from human feedback has been successfully applied.

OpenAI used it to achieve superhuman performance on a classic Atari game, Enduro. Enduro is an old-school racing game, and like all racing games, the point is to gradually pass the other cars without hitting them or going out of bounds in the game.

It’s exceptionally difficult for an agent to learn to play Enduro will using only standard reinforcement learning approaches. But when human feedback is added, the results shift dramatically.

Using Reinforcement Learning from Human Feedback in Robotics

Because robotics almost always involves an agent interacting with the physical world, it’s especially well-suited to reinforcement learning from human feedback.

Often, it can be difficult to get a robot to execute a long series of steps that achieves a valuable reward, especially when the intermediate steps aren’t themselves very valuable. What’s more, it can be especially difficult to build a reward structure that correctly incentivizes the agent to execute the intermediate steps in the right order.

It’s much simpler instead to have humans look at sequences of actions and judge for themselves which will get the agent closer to its ultimate goal.

RLHF For The Contact Center Manager

Having made it this far, you should be in a much better position to understand how reinforcement learning from human feedback works, and how it contributes to the functioning of your contact centers.

If you’ve been thinking about leveraging AI to make yourself or your agents more effective, set up a demo with the Quiq team to see how we can put our cutting-edge models to work for you. We offer both customer-facing and agent-facing tools, all of them designed to help you make customers happier while reducing agent burnout and turnover.

Request A Demo

Author

  • J.R. Rettenmeyer

    JR Rettenmyer is the Principle Applied AI Architect at Quiq. In his previous role at Snaps, an enterprise conversational AI company acquired by Quiq, JR held titles of both VP of Software Engineering and later on, SVP of Product Development. JR is a curious life long learner who leverages his background in product management, software engineering and AI to develop strategies and solutions for our customers.

    View all posts

Subscribe to our blog

Name(Required)
Sign up for our tips and insights delivered right to your inbox, every week.
This field is for validation purposes and should be left unchanged.

AI Studio

Take a free test drive of Quiq's AI Studio

Jump ahead of your competitors with Quiq's AI for the enterprise.

Contact us for a free consultation and to discuss how our innovative approach to Large Language Models can help your business grow.
Index