How to Evaluate Generated Text and Model Performance

How to Measure the Quality of Generated Text and Evaluate Model Performance

Machine learning is an incredibly powerful technology. That’s why it’s being used in everything from autonomous vehicles to medical diagnoses to the sophisticated, dynamic AI Assistants that are handling customer interactions in modern contact centers.

But for all this, it isn’t magic. The engineers who build these systems must know a great deal about how to evaluate them. How do you know when a model is performing as expected, or when it has begun to overfit the data? How can you tell when one model is better than another?

This subject will be our focus today. We’ll cover the basics of evaluating a machine learning model with metrics like mean squared error and accuracy, then turn our attention to the more specialized task of evaluating the generated text of a large language model like ChatGPT.

How to Measure the Performance of a Machine Learning Model?

A machine learning model is always aimed at some task. It might be trying to fit a regression line that helps predict the future price of Bitcoin, it might be clustering documents according to their topics, or it might be trying to generate text so good it rivals that produced by humans.

How does the model know when it’s gotten the optimal line or discovered the best way to cluster documents? (And more importantly, how do you know?)

In the next few sections, we’ll talk about a few common ways of evaluating the performance of a machine-learning model. If you’re an engineer this will help you create better models yourself, and if you’re a layperson, it’ll help you better understand how the machine-learning pipeline works.

Evaluation Metrics for Regression Models

Regression is one of the two big types of basic machine learning, with the other being classification.

In tech-speak, we say that the purpose of a regression model is to learn a function that maps a set of input features to a real value (where “real” just means “real numbers”). This is not as scary as it sounds; you might try to create a regression model that predicts the number of sales you can expect given that you’ve spent a certain amount on advertising, or you might try to predict how long a person will live on the basis of their daily exercise, water intake, and diet.

In each case, you’ve got a set of input features (advertising spend or daily habits), and you’re trying to predict a target variable (sales, life expectancy).

The relationship between the two is captured by a model, and a model’s quality is evaluated with a metric. Popular metrics for regression models include the mean squared error, the root mean squared error, and the mean absolute error (though there are plenty of others if you feel like going down a nerdy rabbit hole).

The mean squared error (MSE) quantifies how good a regression model is by calculating the difference between the line and each real data point, squaring them (so that positive and negative differences don’t cancel out), and then averaging them. This gives a single number that the training algorithm can use to adjust its model – if the MSE is going down, the model is getting better, if it’s going up, it’s getting worse.

The root mean squared error (RMSE) does the exact same thing, but the final step is that you take the square root of the MSE. The big advantage here is that it converts the units of your metric back into the units you’re using in your problem (i.e. the “squared dollars” of MSE become “dollars” again, which makes it easier to think about what’s going on).

The mean absolute error (MAE) is the same basic idea, but it uses absolute values instead of squares. This also has the advantage of not penalizing outliers as much as the RMSE does. If you’ve got some outlier data point that’s far away from your model, squaring the difference will result in a bigger error than simply taking the absolute value of that difference. For this reason, it’s less sensitive to outliers in the dataset.

Evaluation Metrics for Classification Models

People tend to struggle less with understanding classification models because it’s more intuitive: you’re building something that can take a data point (the price of an item) and sort it into one of a number of different categories (i.e. “cheap”, “somewhat expensive”, “expensive”, “very expensive”).

Of course, the categories you choose will depend on the problem you’re trying to solve and the domain you’re operating in – a $100 apple is certainly “very expensive”, but a $100 dollar wedding ring…will probably get you left at the altar.

Regardless, it’s just as essential to evaluate the performance of a classification model as it is to evaluate the performance of a regression model. Some common evaluation metrics for classification models are accuracy, precision, and recall.

Accuracy is simple, and it’s exactly what it sounds like. You find the accuracy of a classification model by dividing the number of correct predictions it made by the total number of predictions it made altogether. If your classification model made 1,000 predictions and got 941 of them right, that’s an accuracy rate of 94.1% (not bad!)

Both precision and recall are subtler variants of this same idea. The precision is the number of true positives (correct classifications) divided by the sum of true positives and false positives (incorrect positive classifications). It says, in effect, “When your model thought it had identified a needle in a haystack, this is how often it was correct.”

The recall is the number of true positives divided by the sum of true positives and false negatives (incorrect negative classifications). It says, in effect “There were 200 needles in this haystack, and your model found 72% of them.”

Accuracy tells you how well your model performed overall, precision tells you how confident you can be in its positive classifications, and recall tells you how often it found the positive classifications.

(You may be wondering if this isn’t overkill. Do we really need all these different ratios? Answering that question fully would take us too far from our purpose of measuring the quality of text from generative AI models, but suffice it to say that there are trade-offs involved. Sometimes it makes more sense to focus on boosting the precision, other times getting a higher recall is more important. These are all just different tools for figuring out how to spend your limited time and energy to get a model that best solves your problem.)

Contact Us

How Can I Assess the Performance of a Generative AI Model?

Now, we arrive at the center of this article. Everything up to now has been background context that hopefully has given you a feel for how models are evaluated, because from here on out it’s a bit more abstract.

Using Reference Text for Evaluating Generative Models

When we wanted to evaluate a regression model, we started by looking at how far its predictions were from actual data points.

Well, we do essentially the same thing with generative language models. To assess the quality of text generated by a model, we’ll compare it against high-quality text that’s been selected by domain experts.

The Bilingual Evaluation Understudy (BLEU) Score

The BLEU score can be used to actually quantify the distance between the generated and reference text. It does this by comparing the amount of overlap in the n-grams [1] between the two using a series of weighted precision scores.

The BLEU score varies from 0 to 1. A score of “0” indicates that there is no n-gram overlap between the generated and reference text, and the model’s output is considered to be of low quality. A score of “1”, conversely, indicates that there is total overlap between the generated and reference text, and the model’s output is considered to be of high quality.

Comparing BLEU scores across different sets of reference texts or different natural languages is so tricky that it’s considered best to avoid it altogether.

Also, be aware that the BLEU score contains a “brevity penalty” which discourages the model from being too concise. If the model’s output is too much shorter than the reference text, this counts as a strike against it.

The Recall-Oriented Understudy for Gisting Evaluation (ROGUE) Score

Like the BLEU score, the ROGUE score is examining the n-gram overlap between an output text and a reference text. Unlike the BLEU score, however, it uses recall instead of precision.

There are three types of ROGUE scores:

  1. rogue-n: Rogue-n is the most common type of ROGUE score, and it simply looks at n-gram overlap, as described above.
  2. rogue-l: Rogue-l looks at the “Longest Common Subsequence” (LCS), or the longest chain of tokens that the reference and output text share. The longer the LCS, of course, the more the two have in common.
  3. rogue-s: This is the least commonly-used variant of the ROGUE score, but it’s worth hearing about. Rogue-s concentrates on the “skip-grams” [2] that the two texts have in common. Rogue-s would count “He bought the house” and “He bought the blue house” as overlapping because they have the same words in the same order, despite the fact that the second sentence does have an additional adjective.

The Metric for Evaluation of Translation with Explicit Ordering (METEOR) Score

The METEOR Score takes the harmonic mean of the precision and recall scores for 1-gram overlap between the output and reference text. It puts more weight on recall than on precision, and it’s intended to address some of the deficiencies of the BLEU and ROGUE scores while maintaining a pretty close match to how expert humans assess the quality of model-generated output.

BERT Score

At this point, it may have occurred to you to wonder whether the BLEU and ROGUE scores are actually doing a good job of evaluating the performance of a generative language model. They look at exact n-gram overlaps, and most of the time, we don’t really care that the model’s output is exactly the same as the reference text – it needs to be at least as good, without having to be the same.

The BERT score is meant to address this concern through contextual embeddings. By looking at the embeddings behind the sentences and comparing those, the BERT score is able to see that “He quickly ate the treats” and “He rapidly consumed the goodies” are expressing basically the same idea, while both the BLEU and ROGUE scores would completely miss this.

Final thoughts.

We’ve all seen what generative AI can do, and it’s fair at this point to assume this technology is going to become more prevalent in fields like software engineering, customer service, customer experience, and marketing.

But, as magical as generative AI might seem to be, they’re just models. They have to be evaluated and monitored just like any other, or you risk having a bad one negatively impact your brand.

If you’re enchanted by the potential of using generative algorithms in your contact center but are daunted by the challenge of putting together an engineering team, reach out to us for a demo of the Quiq conversational CX platform. We can help you put this cutting-edge technology to work without having to worry about all the finer details and resourcing issues.

***

Footnotes

[1] An n-gram is just a sequence of characters, words, or entire sentences. A 1-gram is usually single words, a 2-gram is usually two words, etc.
[2] Skip-grams are a rather involved subdomain of natural language processing. You can read more about them in this article, but frankly, most of it is irrelevant to this article. All you need to know is that the rogue-s score is set up to be less concerned with exact n-gram overlaps than the alternatives.

Subscribe to our blog

Name(Required)
Sign up for our tips and insights delivered right to your inbox, every week.
This field is for validation purposes and should be left unchanged.

Whitepaper

Gen AI Assistants for CX: This Is What We’ve Learned So Far From Deploying Gen AI

Jump ahead of your competitors with Quiq's AI for the enterprise.

Contact us for a free consultation and to discuss how our innovative approach to Large Language Models can help your business grow.
Index