How to Get the Most out of Your NLP Models with Preprocessing

Along with computer vision, natural language processing (NLP) is one of the great triumphs of modern machine learning. While ChatGPT is all the rage and large language models (LLMs) are drawing everyone’s attention, that doesn’t mean that the rest of the NLP field just goes away.

NLP endeavors to apply computation to human-generated language, whether that be the spoken word or text existing in places like Wikipedia. There are any number of ways in which this would be relevant to customer experience and service leaders, including:

Using it to power customer-facing chatbots
Creating question-answering systems
Classifying sentiment from e.g. customer reviews
Automatically transcribing client calls

Today, we’re going to briefly touch on what NLP is, but we’ll spend the bulk of our time discussing how textual training data can be preprocessed to get the most out of an NLP system. There are a few branches of NLP, like speech synthesis and text-to-speech, which we’ll be omitting.

Armed with this context, you’ll be better prepared to evaluate using NLP in your business (though if you’re building customer-facing chatbots, you can also let the Quiq platform do the heavy lifting for you).

Table of Contents

What is Natural Language Processing?

In the past, we’ve jokingly referred to NLP as “doing computer stuff with words after you’ve tricked them into being math.” This is meant to be humorous, but it does capture the basic essence.

Remember, your computer doesn’t know what words are, all it does is move 1’s and 0’s around. A crucial step in most NLP applications, therefore, is creating a numerical representation out of the words in your training corpus.

There are many ways of doing this, but today a popular method is using word vector embeddings. Also known simply as “embeddings”, these are vectors of real numbers. They come from a neural network or a statistical algorithm like word2vec and stand in for particular words.

The technical details of this process don’t concern us in this post, what’s important is that you end up with vectors that capture a remarkable amount of semantic information. Words with similar meanings also have similar vectors, for example, so you can do things like find synonyms for a word by finding vectors that are mathematically close to it.

These embeddings are the basic data structures used across most of NLP. They power sentiment analysis, topic modeling, and many other applications.

For most projects it’s enough to use pre-existing word vector embeddings without going through the trouble of generating them yourself.

Are Large Language Models Natural Language Processing?

Large language models (LLMs) are a subset of natural language processing. Training an LLM draws on many of the same techniques and best practices as the rest of NLP, but NLP also addresses a wide variety of other language-based tasks.

Conversational AI is a great case in point. One way of building a conversational agent is by hooking your application up to an LLM like ChatGPT, but you can also do it with a rules-based approach, through grounded learning, or with an ensemble that weaves together several methods.

Data Preprocessing for NLP

If you’ve ever sent a well-meaning text that was misinterpreted, you know that language is messy. For this reason, NLP places special demands on the data engineers and data scientists who must transform text in various ways before machine learning algorithms can be trained on it.

In the next few sections, we’ll offer a fairly comprehensive overview of data preprocessing for NLP. This will not cover everything you might encounter in the course of preparing data for your NLP application, but it should be more than enough to get started.

Why is Data Preprocessing Important?

They say that data is the new oil, and just as you can’t put oil directly in your gas tank and expect your car to run, you can’t plow a bunch of garbled, poorly-formatted language data into your algorithms and expect magic to come out the other side.

But what, precisely, counts as preprocessing will depend on your goals. You might choose to omit or include emojis, for example, depending on whether you’re training a model to summarize academic papers or write tweets for you.

That having been said, there are certain steps you can almost always expect to take, including standardizing the case of your language data, removing punctuation, white spaces and stop words, segmenting and tokenizing, etc.

We treat each of these common techniques below.

Segmentation and Tokenization

An NLP model is always trained on some consistent chunk of the full data. When ChatGPT was trained, for example, they didn’t put the entire internet in a big truck and back it up to a server farm, they used self-supervised learning.

Simplifying greatly, this means that the underlying algorithm would take, say, the first few three sentences of a paragraph and then try to predict the remaining sentence on the basis of the text that came before. Over time it sees enough language to guess that “to be or not to be, that is ___ ________” ends with “the question.”

But how was ChatGPT shown the first three sentences? How does that process even work?

A big part of the answer is segmentation and tokenization.

With segmentation, we’re breaking a full corpus of training text – which might contain hundreds of books and millions of words – down into units like words or sentences.

This is far from trivial. In English, sentences end with a period, but words like “Mr.” and “etc.” also contain them. It can be a real challenge to divide text into sentences without also breaking “Mr. Smith is cooking the steak.” into “Mr.” and “Smith is cooking the steak.”

Tokenization is a related process of breaking a corpus down into tokens. Tokens are sometimes described as words, but in truth they can be words, short clusters of a few words, sub-words, or even individual characters.

This matters a lot to the training of your NLP model. You could train a generative language model to predict the next sentence based on the preceding sentences, the next word based on the preceding words, or the next character based on the preceding characters.

Regardless, in both segmentation and tokenization, you’re decomposing a whole bunch of text down into individual units that your algorithm can work with.

Making the Case Consistent

It’s standard practice to make the case of your text consistent throughout, as this makes training simpler. This is usually done by lowercasing all the text, though we suppose if you’re feeling rebellious there’s no reason you couldn’t uppercase it (but the NLP engineers might not invite you to their fun Natural Language Parties if you do.)

Fixing Misspellings

NLP, like machine learning more generally, is only as good as its data. If you feed it text with a lot of errors in spelling, it will learn those errors and they’ll show up again later.

This probably isn’t something you’ll want to do manually, and if you’re using a popular language there’s likely a module you can use to do this for you. Python, for example, has TextBlob, Autocorrect, and Pyspellchecker libraries that can handle spelling errors.

Getting Rid of the Punctuation Marks

Natural language tends to have a lot of punctuation, with English utilizing dozens of marks such as ‘!’ and ‘;’ for emphasis and clarification. These are usually removed as part of preprocessing.

This task is something that can be handled with regular expressions (if you have the patience for it…), or you can do it with an NLP library like Natural Language Toolkit (NLTK).

Expanding the Contractions

Contractions are shortened versions of words, like turning “do not” into “don’t” or “would not” into “wouldn’t”. These, too, can be problematic for NLP algorithms and are usually removed during preprocessing.

Stemming

In linguistics, the stem of a word is its root. The words “runs”, “ran”, and “running” all have the word “run” as their base.

Stemming is one of two approaches for reducing the myriad tenses of a word down into a single basic representation. The other is lemmatization, which we’ll discuss in the next section.

Stemming is the cruder of the two, and is usually done with an algorithm known as Porter’s Stemmer. This stemmer doesn’t always produce the stem you’d expect. “Cats” becomes “cat” while “ponies” becomes “poni”, for example. Nevertheless, this is probably sufficient for basic NLP tasks.

Lemmatization

A more sophisticated version of stemming is lemmatization. A stemmer wouldn’t know the difference between the word “left” in “cookies are ahead and to the left” and “he left the book on the table”, whereas a lemmatizer would.

More generally, a lemmatizer uses language-specific context to handle very subtle distinctions between words, and this means it will usually take longer to run than a stemmer.

Whether it makes sense to use a stemmer or a lemmatizer will depend on the use case you’re interested in. Under most circumstances, lemmatizers are more accurate, and stemmers are faster.

Removing Extra White Spaces

It’ll often be the case that a corpus will have an inconsistent set of spacing conventions. This, too, is something algorithm will learn unless it’s remedied during preprocessing.
Removing Stopwords

This is a big one. “Stopwords” are words like “the” or “is” are all stopwords, and they’re almost always removed before training begins because they don’t add much in the way of useful information.

Because this is done so commonly, you can assume that the NLP library you’re using will have some easy way of doing it. NLTK, for example, has a native list of stopwords that can simply be imported:

from nltk.corpus import stopwords

With this, you can simply exclude the stopwords from the corpus.

Ditching the Digits

If you’re building an NLP application that processes data containing numbers, you’ll probably want to remove that as the training algorithm might end up inserting random digits here and there.

This, alas, is something that will probably need to be done with regular expressions.

Part of Speech Tagging

Part of speech tagging refers to the process of automatically tagging a word with extra grammatical information about whether it’s a noun, verb, etc.

This is certainly not something that you always have to do (we’ve completed a number of NLP projects where it never came up), but it’s still worth understanding what it is.

Supercharging Your NLP Applications

Natural language processing is an enormously powerful constellation of techniques that allow computers to do worthwhile work on text data. It can be used to build question-answering systems, tutors, chatbots, and much more.

But to get the most out of it, you’ll need to preprocess the data. No matter how much computing you have access to, machine learning isn’t of much use with bad data. Techniques like removing stopwords, expanding contractions, and lemmatization create corpora of text that can then be fed to NLP algorithms.

Of course, there’s always an easier way. If you’d rather skip straight to the part where cutting-edge conversational AI directly adds value to your business, you can also reach out to see what the Quiq platform can do.