Explainability vs. Interpretability in Machine Learning Models

Explainability vs. Interpretability

In recent months, we’ve produced a tremendous amount of content about generative AI – from high-level primers on what large language models are and how they work, to discussions of how they’re transforming contact centers, to deep dives on the cutting edge of generative technologies.

This amounts to thousands of words, much of it describing how models like ChatGPT were trained by having them e.g. iteratively predict what the final sentence of a paragraph will be given the previous sentences.

But for all that, there’s still a tremendous amount of uncertainty about the inner workings of advanced machine-learning systems. Even the people who build them generally don’t understand how particular functions emerge or what a particular circuit is doing.

It would be more accurate to describe these systems as having been grown, like an inconceivably complex garden. And just as you might have questions if your tomatoes started spitting out math proofs, it’s natural to wonder why generative models are behaving in the way that they are.

These questions are only going to become more important as these technologies are further integrated into contact centers, schools, law firms, medical clinics, and the economy in general.

If we use machine learning to decide who gets a loan, who is likely to have committed a crime, or to have open-ended conversations with our customers, it really matters that we know how all this works.

The two big approaches to this task are explainability and interpretability.

Comparing Explainability and Interpretability

Under normal conditions, this section would come at the very end of the article, after we’d gone through definitions of both these terms and illustrated how they work with copious examples.

We’re electing to include it at the beginning for a reason; while the machine-learning community does broadly agree on what these two terms mean, there’s a lot of confusion about which bucket different techniques go into.

Below, for example, we’ll discuss Shapley Additive Explanations (SHAP). Some sources file this as an approach to explainability, while others consider it a way of making a model more interpretable.

A major contributing factor to this overlap is the simple fact that the two concepts are very closely related. Once you can explain a fact you can probably interpret it, and a big part of interpretation is explanation.

Below, we’ve tried our best to make sense of these important research areas, and have tried to lay everything out in a way that will help you understand what’s going on.

With that caveat out of the way, let’s define explainability and interpretability.

Broadly, explainability means analyzing the behavior of a model to understand why a given course of action was taken. If you want to know why data point “a” was sorted into one category while data point “b” was sorted into another, you’d probably turn to one of the explainability techniques described below.

Interpretability means making features of a model, such as its weights or coefficients, comprehensible to humans. Linear regression models, for example, calculate sums of weighted input features, and interpretability would help you understand what exactly that means.

Here’s an analogy that might help: you probably know at least a little about how a train works. Understanding that it needs fuel to move, has to have tracks constructed a certain way to avoid crashing, and needs brakes in order to stop would all contribute to the interpretability of the train system.

But knowing which kind of fuel it requires and for what reason, why the tracks must be made out of a certain kind of material, and how exactly pulling a brake switch actually gets the train to stop are all facets of the explainability of the train system.

What is Explainability in Machine Learning?

In machine learning, explainability refers to any set of techniques that allow you to reason about the nuts and bolts of the underlying model. If you can at least vaguely follow how data are processed and how they impact the final model output, the system is explainable to that degree.

Before we turn to the techniques utilized in machine learning explainability, let’s talk at a philosophical level about the different types of explanations you might be looking for.

Different Types of Explanations

There are many approaches you might take to explain an opaque machine-learning model. Here are a few:

  • Explanations by text: One of the simplest ways of explaining a model is by reasoning about it with natural language. The better sorts of natural-language explanations will, of course, draw on some of the explainability techniques described below. You can also try to talk about a system logically, by i.e. describing it as calculating logical AND, OR, and NOT operations.
  • Explanations by visualization: For many kinds of models, visualization will help tremendously in increasing explainability. Support vector machines, for example, use a decision boundary to sort data points and this boundary can sometimes be visualized. For extremely complex datasets this may not be appropriate, but it’s usually worth at least trying.
  • Local explanations: There are whole classes of explanation techniques, like LIME, that operate by illustrating how a black-box model works in some particular region. In other words, rather than trying to parse the whole structure of a neural network, we zoom in on one part of it and say “This is what it’s doing right here.”

Approaches to Explainability in Machine Learning

Now that we’ve discussed the varieties of explanation, let’s get into the nitty-gritty of how explainability in machine learning works. There are a number of different explainability techniques, but we’re going to focus on two of the biggest: SHAP and LIME.

Shapley Additive Explanations (SHAP) are derived from game theory and are a commonly-used way of making models more explainable. The basic idea is that you’re trying to parcel out “credit” for the model’s outputs among its input features. In game theory, potential players can choose to enter a game, or not, and this is the first idea that is ported over to SHAP.

SHAP “values” are generally calculated by looking at how a model’s output changes based on different combinations of features. If a model has, say, 10 input features, you could look at the output of four of them, then see how that changes when you add a fifth.

By running this procedure for many different feature sets, you can understand how any given feature contributes to the model’s overall predictions.

Local Interpretable Model-Agnostic Explanation (LIME) is based on the idea that our best bet in understanding a complex model is to first narrow our focus to one part of it, then study a simpler model that captures its local behavior.

Let’s work through an example. Imagine that you’ve taken an enormous amount of housing data and fit a complex random forest model that’s able to predict the price of a house based on features like how old it is, how close it is to neighbors, etc.

LIME lets you figure out what the random forest is doing in a particular region, so you’d start by selecting one row of the data frame, which would contain both the input features for a house and its price. Then, you would “perturb” this sample, which means that for each of its features and its price, you’d sample from a distribution around that data point to create a new, perturbed dataset.

You would feed this perturbed dataset into your random forest model and get a new set of perturbed predictions. On this complete dataset, you’d then train a simple model, like a linear regression.

Linear regression is almost never as flexible and powerful as a random forest, but it does have one advantage: it comes with a bunch of coefficients that are fairly easy to interpret.

This LIME approach won’t tell you what the model is doing everywhere, but it will give you an idea of how the model is behaving in one particular place. If you do a few LIME runs, you can form a picture of how the model is functioning overall.

What is Interpretability in Machine Learning?

In machine learning, interpretability refers to a set of approaches that shed light on a model’s internal workings.

SHAP, LIME, and other explainability techniques can also be used for interpretability work. Rather than go over territory we’ve already covered, we’re going to spend this section focusing on an exciting new field of interpretability, called “mechanistic” interpretability.

Mechanistic Interpretability: A New Frontier

Mechanistic interpretability is defined as “the study of reverse-engineering neural networks”. Rather than examining subsets of input features to see how they impact a model’s output (as we do with SHAP) or training a more interpretable local model (as we do with LIME), mechanistic interpretability involves going directly for the goal of understanding what a trained neural network is really, truly doing.

It’s a very young field that so far has only tackled networks like GPT-2 – no one has yet figured out how GPT-4 functions – but already its results are remarkable. It will allow us to discover the actual algorithms being learned by large language models, which will give us a way to check them for bias and deceit, understand what they’re really capable of, and how to make them even better.

Why are Interpretability and Explainability Important?

Interpretability and explainability are both very important areas of ongoing research. Not so long ago (less than twenty years), neural networks were interesting systems that weren’t able to do a whole lot.

Today, they are feeding us recommendations for news, entertainment, driving cars, trading stocks, generating reams of content, and making decisions that affect people’s lives, forever.

This technology is having a huge and growing impact, and it’s no longer enough for us to have a fuzzy, high-level idea of what they’re doing.

We now know that they work, and with techniques like SHAP, LIME, mechanistic interpretability, etc., we can start to figure out why they work.

Final Thoughts on Interpretability vs. Explainability

In contact centers and elsewhere, large language models are changing the game. But though their power is evident, they remain a predominately empirical triumph.

The inner workings of large language models remain a mystery, one that has only recently begun to be unraveled through techniques like the ones we’ve discussed in this article.

Though it’s probably asking too much to expect contact center managers to become experts in machine learning interpretability or explainability, hopefully, this information will help you make good decisions about how you want to utilize generative AI.

And speaking of good decisions, if you do decide to move forward with deploying a large language model in your contact center, consider doing it through one of the most trusted names in conversational AI. In recent weeks, the Quiq platform has added several tools aimed at making your agents more efficient and your customers happier.

Set up a demo today to see how we can help you!

Request A Demo

Subscribe to our blog

Sign up for our tips and insights delivered right to your inbox, every week.
This field is for validation purposes and should be left unchanged.


Gen AI Assistants for CX: This Is What We’ve Learned So Far From Deploying Gen AI

Jump ahead of your competitors with Quiq's AI for the enterprise.

Contact us for a free consultation and to discuss how our innovative approach to Large Language Models can help your business grow.