It’s hard to imagine an application, website or workflow that wouldn’t benefit in some way from the new electricity that is generative AI. But what does it look like to integrate an LLM into an application? Is it just a matter of hitting a REST API with some basic auth credentials, or is there more to it than that?
In this article, we’ll enumerate the things you should consider when planning an LLM integration.
Why Integrate an LLM?
At first glance, it might not seem like LLMs make sense for your application—and maybe they don’t. After all, is the ability to write a compelling poem about a lost Highland Cow named Bo actually useful in your context? Or perhaps you’re not working on anything that remotely resembles a chatbot. Do LLMs still make sense?
The important thing to know about ‘Generative AI’ is that it’s not just about generating creative content like poems or chat responses. Generative AI (LLMs) can be used to solve a bevy of other problems that roughly fall into three categories:
- Making decisions (classification)
- Transforming data
- Extracting information
Let’s use the example of an inbound email from a customer to your business. How might we use LLMs to streamline that experience?
- Making Decisions
- Is this email relevant to the business?
- Is this email low, medium or high priority?
- Does this email contain inappropriate content?
- What person or department should this email be routed to?
- Transforming data
- Summarize the email for human handoff or record keeping
- Redact offensive language from the email subject and body
- Extracting information
- Extract information such as a phone number, business name, job title etc from the email body to be used by other systems
- Generating Responses
- Generate a personalized, contextually-aware auto-response informing the customer that help is on the way
- Alternatively, deploy a more sophisticated LLM flow (likely involving RAG) to directly address the customer’s need
It’s easy to see how solving these tasks would increase user satisfaction while also improving operational efficiency. All of these use cases are utilizing ‘Generative AI’, but some feel more generative than others.
When we consider decision making, data transformation and information extraction in addition to the more stereotypical generative AI use cases, it becomes harder to imagine a system that wouldn’t benefit from an LLM integration. Why? Because nearly all systems have some amount of human-generated ‘natural’ data (like text) that is no longer opaque in the age of LLMs.
Prior to LLMs, it was possible to solve most of the tasks listed above. But, it was exponentially harder. Let’s consider ‘is this email relevant to the business’. What would it have taken to solve this before LLMs?
- A dataset of example emails labeled true if they’re relevant to the business and false if not (the bigger the better)
- A training pipeline to produce a custom machine learning model for this task
- Specialized hardware or cloud resources for training & inferencing
- Data scientists, data curators, and Ops people to make it all happen
LLMs can solve many of these problems with radically lower effort and complexity, and they will often do a better job. With traditional machine learning models, your model is, at best, as good as the data you give it. With generative AI you can coach and refine the LLM’s behavior until it matches what you desire – regardless of historical data.
For these reasons LLMs are being deployed everywhere—and consumers’ expectations continue to rise.
How Do You Feel About LLM Vendor Lock-In?
Once you’ve decided to pursue an LLM integration, the first issue to consider is whether you’re comfortable with vendor lock-in. The LLM market is moving at lightspeed with the constant release of new models featuring new capabilities like function calls, multimodal prompting, and of course increased intelligence at higher speeds. Simultaneously, costs are plummeting. For this reason, it’s likely that your preferred LLM vendor today may not be your preferred vendor tomorrow.
Even at a fixed point in time, you may need more than a single LLM vendor.
In our recent experience, there are certain classification problems that Anthropic’s Claude does a better job of handling than comparable models from OpenAI. Similarly, we often prefer OpenAI models for truly generative tasks like generating responses. All of these LLM tasks might be in support of the same integration so you may want to look at the project not so much as integrating a single LLM or vendor, but rather a suite of tools.
If your use case is simple and low volume, a single vendor is probably fine. But if you plan to do anything moderately complex or high scale you should plan on integrating multiple LLM vendors to have access to the right models at the best price.
Resiliency & Scalability are Earned—Not Given
Making API calls to an LLM is trivial. Ensuring that your LLM integration is resilient and scalable requires more elbow grease. In fact, LLM API integrations pose unique challenges:
Challenge | Solutions |
They are pretty slow | If your application is high-scale and you’re doing synchronous (threaded) network calls, your application won’t scale very well since most threads will be blocked on LLM calls. Consider switching to async I/O. You’ll also want to support running multiple prompts in parallel to reduce visible latency to the user. |
They are throttled by requests per minute and tokens per minute | Attempt to estimate your LLM usage in terms of requests and LLM tokens per minute and work with your provider(s) to ensure sufficient bandwidth for peak load |
They are (still) kinda flakey (unpredictable response times, unresponsive connections) | Employ various retry schemes in response to timeouts, 500s, 429s (rate limit) etc. |
The above remediations will help your application be scalable and resilient while your LLM service is up. But what if it’s down? If your LLM integration is on a critical execution path you’ll want to support automatic failover. Some LLMs are available from multiple providers:
- OpenAI models are hosted by OpenAI itself as well as Azure
- Anthropic models are hosted by Anthropic itself as well as AWS
Even if an LLM only has a single provider, or even if it has multiple, you can also provision the same logical LLM in multiple cloud regions to achieve a failover resource. Typically you’ll want the provider failover to be built into your retry scheme. Our failover mechanisms get tripped regularly out in production at Quiq, no doubt partially because of how rapidly the AI world is moving.
Are You Actually Building an Agentic Workflow?
Oftentimes you have a task that you know is well-suited for an LLM. For example, let’s say you’re planning to use an LLM to analyze the sentiment of product reviews. On the surface, this seems like a simple task that will require one LLM call that passes in the product review and asks the LLM to decide the sentiment. Will a single prompt suffice? What if we also want to determine if a given review contains profanity or personal information? What if we want to ask three LLMs and average their results?
Many tasks require multiple prompts, prompt chaining and possibly RAG (Retrieval Augmented Generation) to best solve a problem. Just like humans, AI produces better results when a problem is broken down into pieces. Such solutions are variously known as AI Agents, Agentic Workflows or Agent Networks and are why open source tools like LangChain were originally developed.
In our experience, pretty much every prompt eventually grows up to be an Agentic Workflow, which has interesting implications for how it’s configured & monitored.
Be Ready for the Snowball Effect
Introducing LLMs can result in a technological snowball effect, particularly if you need to use Retrieval Augmented Generation (RAG). LLMs are trained on mostly public data that was available at a fixed point in the past. If you want an LLM to behave in light of up-to-date and/or proprietary data sources (which most non-trivial applications do) you’ll need to do RAG.
RAG refers to retrieving the up-to-date and/or proprietary data you want the LLM to use in its decision making and passing it to the LLM as part of your prompt.
Assuming you need to search a reference dataset like a knowledge base, product catalog or product manual, the retrieval part of RAG typically entails adding the following entities to your system:
1. An embedding model
An embedding model is roughly half of an LLM – it does a great job of reading and understanding information you pass it but instead of generating a completion it produces a numeric vector that encodes its understanding of the source material.
You’ll typically run the embeddings model on all of the business data you want to search and retrieve for the LLM. Most LLM providers also have embedding models, or you can hit one via any major cloud.
2. A vector database
Once you have embeddings for all of your business data, you need to store them somewhere that facilitates speedy search based on numeric vectors. Solutions like Pinecone and MilvusDB fill this need, but that means integrating a new vendor or hosting a new database internally.
After implementing embeddings and a vector search solution, you can now retrieve information to include in the prompts you send to your LLM(s). But how can you trust that the LLM’s response is grounded in the information you provided and not something based on stale information or purely made up?
There are specialized deep learning models that exist solely for the purpose of ensuring that an LLM’s generative claims are grounded in facts you provide. This practice is variously referred to as hallucination detection, claim verification, NLI, etc. We believe NLI models are an essential part of a trustworthy RAG pipeline, but managed cloud solutions are scarce and you may need to host one yourself on GPU-enabled hardware.
Is a Black Box Sustainable?
If you bake your LLM integration directly into your app, you will effectively end up with a black box that can only be understood and improved by engineers. This could make sense if you have a decent size software shop and they’re the only folks likely to monitor or maintain the integration.
However, your best software engineers may not be your best (or most willing) prompt engineers, and you may wish to involve other personas like product and experience designers since an LLM’s output is often part of your application’s presentation layer & brand.
For these reasons, prompts will quickly need to move from code to configuration – no big deal. However, as an LLM integration matures it will likely become an Agentic Workflow involving:
- More prompts, prompt parallelization & chaining
- More prompt engineering
- RAG and other orchestration
Moving these concerns into configuration is significantly more complex but necessary on larger projects. In addition, people will inevitably want to observe and understand the behavior of the integration to some degree.
For this reason it might make sense to embrace a visual framework for developing Agentic Workflows from the get-go. By doing so you open up the project to collaboration from non-engineers while promoting observability into the integration. If you don’t go this route be prepared to continually build out configurability and observability tools on the side.
Quiq’s AI Automations Take Care of LLM Integration Headaches For You
Hopefully we’ve given you a sense for what it takes to build an enterprise LLM integration. Now it’s time for the plug. The considerations outlined above are exactly why we built AI Studio and particularly our AI Automations product.
With AI automations you can create a serverless API that handles all the complexities of a fully orchestrated AI-flow, including support for multiple LLMs, chaining, RAG, resiliency, observability and more. With AI Automations your LLM integration can go back to being ‘just an API call with basic auth’.
Want to learn more? Dive into AI Studio or reach out to our team.