Comparing and Evaluating Large Language Models

Key Takeaways

Not all LLMs are created equal: They differ in architecture, size, openness (open vs. closed source), and specialization across industries and tasks.
Fine-tuning and RAG improve performance: Custom training or adding external data through retrieval-augmented generation helps LLMs perform better on domain-specific needs.
Open vs. closed trade-offs matter: Closed models offer ease and polish, while open models provide flexibility and control for customization.
Choosing the right LLM depends on your goals: The best model is the one that aligns with your business priorities, whether that’s speed, accuracy, customization, or cost-efficiency.

From ChatGPT and Bard to BLOOM and Claude, there is a veritable ocean of current LLMs (large language models) for you to choose from. Some are specialized for specific use cases, some are open-source, and there’s a huge variance in the number of parameters they contain.

If you’re a CX leader and find yourself fascinated by the potential of using this technology in your contact center, it can be hard to know how to run proper LLM comparisons.

Today, we’re going to tackle this issue head-on by talking about specific criteria you can use to compare LLMs, sources of additional information, and some of the better-known options.

But always remember that the point of using an LLM is to deliver a world-class customer experience, and the best option is usually the one that delivers multi-model functionality with a minimum of technical overhead.

With that in mind, let’s get started!

What is Generative AI?

While it may seem like large language models (LLMs) and generative AI have only recently emerged, the work they’re based on goes back decades. The journey began in the 1940s with Walter Pitts and Warren McCulloch, who designed artificial neurons based on early brain research. However, practical applications became feasible only after the development of the backpropagation algorithm in 1985, which enabled effective training of larger neural networks.

By 1989, researchers had developed a convolutional system capable of recognizing handwritten numbers. Innovations such as long short-term memory networks further enhanced machine learning capabilities during this period, setting the stage for more complex applications.

The 2000s ushered in the era of big data, crucial for training generative pre-trained models like ChatGPT. This combination of decades of foundational research and vast datasets culminated in the sophisticated generative AI and current LLMs we see transforming contact centers and related industries today.

What’s the Best Way to do a Large Language Models Comparison?

If you’re shopping around for a current LLM for a particular application, it makes sense to first clarify the evaluation criteria you should be using. We’ll cover that in the sections below.

Large Language Models Comparison By Industry Use Case

One of the more remarkable aspects of current LLMs is that they’re good at so many things. Out of the box, most can do very well at answering questions, summarizing text, translating between natural languages, and much more.

But there might be situations in which you’d want to boost the performance of one of the current LLMs on certain tasks. The two most popular ways of doing this are retrieval-augmented generation (RAG) and fine-tuning a pre-trained model.

Here’s a quick recap of what both of these are:

Retrieval-augmented generation refers to getting one of the general-purpose, current LLMs to perform better by giving them access to additional resources they can use to improve their outputs. You might hook it up to a contact-center CRM so that it can provide specific details about orders, for example.
Fine-tuning refers to taking a pre-trained model and honing it for specific tasks by continuing its training on data related to that task. A generic model might be shown hundreds of polite interactions between customers and CX agents, for example, so that it’s more courteous and helpful.

So, if you’re considering using one of the current LLMs in your business, there are a few questions you should ask yourself. First, are any of them perfectly adequate as-is? If they’re not, the next question is how “adaptable” they are. It’s possible to use RAG or fine-tuning with most of the current LLMs, the question is how easy they make it.

Of course, by far the easiest option would be to leverage a model-agnostic conversational AI platform for CX. These can switch seamlessly between different models, and some support RAG out of the box, meaning you aren’t locked into one current LLM and can always reach for the right tool when needed.

What’s a Good Way To Think About an Open-Source or Closed-Source Large Language Models Comparison?

You’ve probably heard of “open-source,” which refers to the practice of releasing source code to the public so that it can be forked, modified, and scrutinized.

The open-source approach has become incredibly popular, and this enthusiasm has partially bled over into artificial intelligence and machine learning. It is now fairly common to open-source software, datasets, and training frameworks like TensorFlow.

How does this translate to the realm of large language models? In truth, it’s a bit of a mixture. Some models are proudly open-sourced, while others jealously guard their model’s weights, training data, and source code.

This is one thing you might want to consider as you carry out your LLM comparisons. Some of the very best models, like ChatGPT, are closed-source. The downside of using such a model is that you’re entirely beholden to the team that built it. If they make updates or go bankrupt, you could be left scrambling at the last minute to find an alternative solution.

There’s no one-size-fits-all approach here, but it’s worth pointing out that a high-quality enterprise solution will support customization by allowing you to choose between different models (both close-source and open-source). This way, you needn’t concern yourself with forking repos or fret over looming updates, you can just use whichever model performs the best for your particular application.

Getting A Large Language Models Comparison Through Leaderboards and Websites

Instead of doing your LLM comparisons yourself, you could avail yourself of a service built for this purpose.

Whatever rumors you may have heard, programmers are human beings, and human beings have a fondness for ranking and categorizing pretty much everything – sports teams, guitar solos, classic video games, you name it.

Naturally, as current LLMs have become better known, leaderboards and websites have popped up comparing them along all sorts of different dimensions. Here are a few you can use as you search around for the best current LLMs.

Leaderboards for Comparing LLMs

In the past couple of months, leaderboards have emerged which directly compare various current LLMs.

One is AlpacaEval, which uses a custom dataset to compare ChatGPT, Claude, Cohere, and other LLMs on how well they can follow instructions. AlpacaEval boasts high agreement with human evaluators, so in our estimation, it’s probably a suitable way of initially comparing LLMs, though more extensive checks might be required to settle on a final list.

Another good choice is Chatbot Arena, which pits two anonymous models side-by-side, has you rank which one is better, then aggregates all the scores into a leaderboard.

Finally, there is Hugging Face’s Open LLM Leaderboard, which is similar. Anyone can submit a new model for evaluation, which is then assessed based on a small set of key benchmarks from the Eleuther AI Language Model Evaluation Harness. These capture how well the models do in answering simple science questions, common-sense queries, and more, which will be of interest to CX leaders.

When combined with the criteria we discussed earlier, these leaderboards and comparison websites ought to give you everything you need to execute a constructive large language models comparison.

What are the Currently-Available Large Language Models?

Okay! Now that we’ve worked through all this background material, let’s turn to discussing some of the major LLMs that are available today. We make no promises about these entries being comprehensive (and even if they were, there’d be new models out next week), but they should be sufficient to give you an idea as to the range of options you have.

ChatGPT and GPT

Obviously, the titan in the field is OpenAI’s ChatGPT, which is really just a version of GPT that has been fine-tuned through reinforcement learning from human feedback to be especially good at sustained dialogue.

ChatGPT and GPT have been used in many domains, including customer service, question answering, and many others. As of this writing, the most recent GPT is version 4o (note: that’s the letter ‘o’, not the number ‘0’).

LLaMA

In April 2024, Facebook’s AI team released version three of its Large Language Model Meta AI (LLaMa 3). At 70 billion parameters it is not quite as big as GPT; this is intentional, as its purpose is to aid researchers who may not have the budget or expertise required to provision a behemoth LLM.

Gemini

Like GPT-4, Google’s Gemini is aimed squarely at dialogue. It is able to converse on a nearly infinite number of subjects, and from the beginning, the Google team has focused on having Gemini produce interesting responses that are nevertheless absent of abuse and harmful language.

StableLM

StableLM is a lightweight, open-source language model built by Stability AI. It’s trained on a new dataset called “The Pile”, which is itself made up of over 20 smaller, high-quality datasets which together amount to over 825 GB of natural language.

GPT4All

What would you get if you trained an LLM on “…on a massive curated corpus of assistant interactions, which included word problems, multi-turn dialogue, code, poems, songs, and stories,” and then released it on an Apache 2.0 license? The answer is GPT4All, an open-source model whose purpose is to encourage research into what these technologies can accomplish.

BLOOM

The BigScience Large Open-Science Open-Access Multilingual Language Model (BLOOM) was released in late 2022. The team that put it together consisted of more than a thousand researchers from all over the worlds, and unlike the other models on this list, it’s specifically meant to be interpretable.

Pathways Language Model (PaLM)

PaLM is from Google, and is also enormous (540 billion parameters). It excels in many language-related tasks, and became famous when it produced really high-level explanations of tricky jokes. The most recent version is PaLM 2.

Claude

Anthropic’s Claude is billed as a “next-generation AI assistant.” The recent release of Claude 3.5 Sonnet “sets new industry benchmarks” in speed and intelligence, according to materials put out by the company. We haven’t looked at all the data ourselves, but we have played with the model and we know it’s very high-quality.

Command and Command R+

These are models created by Cohere, one of the major commercial platforms for current LLMs. They are comparable to most of the other big models, but Cohere has placed a special focus on enterprise applications, like agents, tools, and RAG.

What are the Best Ways of Overcoming the Limitations of Large Language Models?

Large language models are remarkable tools, but they nevertheless suffer from some well-known limitations. They tend to hallucinate facts, for example, sometimes fail at basic arithmetic, and can get lost in the course of lengthy conversations.

Overcoming the limitations of large language models is mostly a matter of either monitoring them and building scaffolding to enable RAG, or partnering with a conversational AI platform for CX that handles this tedium for you.

An additional wrinkle involves tradeoffs between different models. As we discuss below, sometimes models may outperform the competition on a task like code generation while being notably worse at a task like faithfully following instructions; in such cases, many opt to have an ensemble of models so they can pick and choose which to deploy in a given scenario. (It’s worth pointing out that even if you want to use one model for everything, you’ll absolutely need to swap in an upgraded version of that model eventually, so you still have the same model-management problem.)

This, too, is a place where a conversational AI platform for CX will make your life easier. The best such platforms are model-agnostic, meaning that they can use ChatGPT, Claude, Gemini, or whatever makes sense in a particular situation. This removes yet another headache, smoothing the way for you to use generative AI in your contact center with little fuss.

What are the Best Large Language Models?

Having read the foregoing, it’s natural to wonder if there’s a single model that best suits your enterprise. The answer is “it depends on the specifics of your use case.” You’ll have to think about whether you want an open-source model you control or you’re comfortable hitting an API, whether your use case is outside the scope of ChatGPT and better handled with a bespoke model, etc.

Speaking of use cases, in the next few sections, we’ll offer some advice on which current LLMs are best suited for which applications. However, this advice is based mostly on personal experience and other people’s reports of their experiences. This should be good enough to get you started, but bear in mind that these claims haven’t been born out by rigorous testing and hard evidence—the field is too young for most of that to exist yet.

What’s the Best LLM if I’m on a Budget?

Pretty much any open-source model is given away for free, by definition. You can just Google “free open-source LLMs”, but one of the more frequently recommended open-source models is LLaMA 2 (there’s also the new LLaMA 3), both of which are free.

But many LLMs (both free and paid) also use the data you feed them for training purposes, which means you could be exposing proprietary or sensitive data if you’re not careful. Your best bet is to find a cost-effective platform that has an explicit promise not to use your data for training.

When you deal with an open-source model, you also have to pay for hosting, either your own or through a cloud service like Amazon Bedrock.

What’s the Best LLM for a Large Context Window?

The context window is the amount of text an LLM can handle at a time. When ChatGPT was released, it had a context window of around 4,000 tokens. (A “token” isn’t exactly a word, but it’s close enough for our purposes.)

Generally (and up to a point), the longer the context window the better the model is able to perform. Today’s models generally have context windows of at least a few tens of thousands, and some getting into the lower 100,000 range.

But, at a staggering 1 million tokens–equivalent to an hour-long video or the full text of a long novel–Google’s Gemini simply towers over the others like Hagrid in the Shire.

That having been said, this space moves quickly, and context window length is an active area of research and development. These figures will likely be different next month, so be sure to check the latest information as you begin shopping for a model.

Choosing Among the Current Large Language Models

With all the different LLMs on offer, it’s hard to narrow the search down to the one that’s best for you. By carefully weighing the different metrics we’ve discussed in this article, you can choose an LLM that meets your needs with as little hassle as possible.

Pulling back a bit, let’s close by recalling that the whole purpose of choosing among current LLMs in the first place is to better meet the needs of our customers.

For this reason, you might want to consider working with a conversational AI platform for CX, like Quiq, that puts a plethora of LLMs at your fingertips through one simple interface.

Request A Demo

Frequently Asked Questions (FAQs)

What is a Large Language Model (LLM)?

A Large Language Model (LLM) is an advanced AI system trained on massive text datasets to understand and generate human-like language.

How do different LLMs compare to each other?

LLMs vary in architecture, training data, openness (open vs. closed source), and specialization. Some are optimized for creativity and reasoning, while others excel in technical accuracy or enterprise security.

What’s the difference between open-source and closed-source LLMs?

Open-source LLMs allow full customization and transparency, making them ideal for organizations that want control and flexibility. Closed-source LLMs are proprietary, typically offering higher polish, security support, and easier deployment.

What is Retrieval-Augmented Generation (RAG)?

Retrieval-Augmented Generation is a method that enhances an LLM’s accuracy by allowing it to access external data sources in real time, ensuring responses are grounded in the latest, most relevant information.

How does fine-tuning improve LLM performance?

Fine-tuning adjusts a base model with domain-specific data, improving accuracy and relevance for specialized tasks like customer support, healthcare insights, or financial analysis.

How can I choose the right LLM for my business?

Start by defining your goals: speed, accuracy, cost, or customization. Then, evaluate models against benchmarks, integrations, and data privacy needs to find the best fit for your use case.

What industries benefit most from LLMs?

LLMs are transforming industries like customer experience, education, healthcare, and software development – helping teams automate workflows, summarize data, and personalize communication at scale.

Author

J.R. Rettenmeyer

JR Rettenmyer is the Principle Applied AI Architect at Quiq. In his previous role at Snaps, an enterprise conversational AI company acquired by Quiq, JR held titles of both VP of Software Engineering and later on, SVP of Product Development. JR is a curious life long learner who leverages his background in product management, software engineering and AI to develop strategies and solutions for our customers.

View all posts

Featured Resource

Complete Agentic AI Buyer’s Kit for CX

Featured Resource

Complete Agentic AI Buyer’s Kit for CX

Current Large Language Models and How They Compare

Key Takeaways

What is Generative AI?

What’s the Best Way to do a Large Language Models Comparison?

Large Language Models Comparison By Industry Use Case

What’s a Good Way To Think About an Open-Source or Closed-Source Large Language Models Comparison?

Getting A Large Language Models Comparison Through Leaderboards and Websites

Leaderboards for Comparing LLMs

What are the Currently-Available Large Language Models?

ChatGPT and GPT

LLaMA

Gemini

StableLM

GPT4All

BLOOM

Pathways Language Model (PaLM)

Claude

Command and Command R+

What are the Best Ways of Overcoming the Limitations of Large Language Models?

What are the Best Large Language Models?

What’s the Best LLM if I’m on a Budget?

What’s the Best LLM for a Large Context Window?

Choosing Among the Current Large Language Models

Frequently Asked Questions (FAQs)

Author

Learn how Quiq is leveraging LLMs to build customer-centric AI for CX.

Featured Resource

Complete Agentic AI Buyer’s Kit for CX

Featured Resource

Complete Agentic AI Buyer’s Kit for CX

Key Takeaways

What is Generative AI?

What’s the Best Way to do a Large Language Models Comparison?

Large Language Models Comparison By Industry Use Case

What’s a Good Way To Think About an Open-Source or Closed-Source Large Language Models Comparison?

Getting A Large Language Models Comparison Through Leaderboards and Websites

Leaderboards for Comparing LLMs

What are the Currently-Available Large Language Models?

ChatGPT and GPT

LLaMA

Gemini

StableLM

GPT4All

BLOOM

Pathways Language Model (PaLM)

Claude

Command and Command R+

What are the Best Ways of Overcoming the Limitations of Large Language Models?

What are the Best Large Language Models?

What’s the Best LLM if I’m on a Budget?

What’s the Best LLM for a Large Context Window?

Choosing Among the Current Large Language Models

Frequently Asked Questions (FAQs)

Author

Subscribe to our blog

AI for CX Buyer's Kit

Everything CX leaders need to choose the right agentic AI solution.

Learn how Quiq is leveraging LLMs to build customer-centric AI for CX.

Related stories