What is NLP?
If you’ve ever taken an introductory programming class, there’s a good chance the instructor said: “Computers are stupid.” They only do what they’re told, and only within the confines of what can be represented in logic. The intent is to remind students that computers aren’t intelligent thinking machines — like Star Trek’s Data — but rather tools that must be instructed in a very particular way.
That axiom remains true, although it glosses over the fact that computers have — through the power of sophisticated AI models — have recently become very capable at understanding the meaning and intent of spoken or written language. Natural Language Processing (NLP) describes a broad field of study that focuses precisely on that.
You probably interact with NLP technology in your daily lives — whether asking Alexa or Siri to play your favorite music, or using ChatGPT to explain a complicated subject in layperson’s terms. These tools make our lives more convenient, and they’re only getting better as the underlying technology improves.
This article will introduce you to the underlying principles behind Natural Language Processing, how they’re used, and their capabilities and limitations. We’ll then show you some of the incredible advancements in NLP technology that have happened in the past few years, and explain what this means for future products and applications.
What is Natural Language Processing?
The term NLP describes a field of study that encompasses several disciplines — including computer science, linguistics, and AI — and focuses on enabling computers to understand, interpret, and generate natural human language in both its spoken and written forms.
We see the benefits of NLP in our daily lives, but we shouldn’t take them for granted. Human language is messy and complicated. It’s not merely diverse — with countless different languages, dialects, and accents — but also constantly evolving. Oh, and it's filled with irregularities, such as idioms with both figurative and literal meanings.
“Dave said mean things about Lucy behind her back.” Should that be interpreted figuratively (as in, without Lucy knowing) or literally (Dave stood behind Lucy when he said mean things)? As a human, you can recognise the meaning here. For computers, it’s a lot harder.
Idioms — which one NLP researcher, Stanford computational linguist Ivan Sag, described as a “pain in the neck” — are just one quirk of human language. But they aren’t the only obstacle that needed to be overcome before we reached the point where NLP could be a useful force in our daily lives.
The earliest work in NLP began at the dawn of the computer age — mere years after the first electronic stored-program computer, the Manchester Baby, debuted — and was theoretical in nature. In a 1950 paper for the journal Mind, the legendary computer scientist and codebreaker Alan Turing proposed The Imitation Game — a method of determining whether a computer can behave in a way indistinguishable from human behaviour.
The Imitation Game — today known as The Turing Test — involves three participants: a human player, a computer player, and a human evaluator. The evaluator interrogates the players through a text-based interface in an attempt to identify which player is the computer. If the evaluator is unable to do so, then the computer is said to have “passed” — or, more aptly, exhibited intelligent behaviour.
Although the Turing Test was hypothetical (no machine at the time could possibly beat it), it was nonetheless significant because it highlighted the “separateness” of language as a problem. Harry Collins, Professor of Social Science, described it as a “test of linguistic fluency.” Whereas some of the earliest speculative research in AI focused on practical applications — like playing chess or the military — Turing’s work centred on human-computer interactions.
The Evolution of Natural Language Processing
The 1950s were a time of tremendous development in the field of computer science, with computers moving from the purely academic and government domain to slowly discern a home in the business world. In 1951, Remington Rand released the first – the UNIVAC I — the world’s first commercially available digital computer. Other vendors, notably IBM, soon followed.
These businesses needed to demonstrate a commercial need for computing. They needed to impress those with the deepest pockets and the highest levels of influence. And so, IBM partnered with Georgetown University to create the world’s first machine translation system, programming an IBM 701 computer to translate rudimentary phrases between Russian and English.
This was the first demonstration of a working machine translation system in action — something previously theorized, but never actually fulfilled. Unsurprisingly, given the technical limitations of the underlying hardware, it was decidedly rudimentary. The translator understood just six grammar rules, and had a small dictionary. But it worked.
IBM’s experiment was a well-publicised success, and governments began investing heavily in computational linguistics — and, thus, natural language processing.
For the remainder of the 20th century, computer scientists tried to solve the problem of NLP by treating language as the product of a rule-based system, where things can be represented by easily comprehended if-then-else logic. This approach — often described as “symbolic NLP” — isn’t without merit. Human language is, after all, something that’s inherently rule based.
For example, in English, adjectives usually precede the noun they relate to. In French, the opposite is usually (though not always) true. Take the phrase “the black car.” In English, the adjective (“black”) comes before the noun (“car”). In French, it’s “la voiture noire.” Here, the order is totally flipped, with black (“noire”) coming after the word for car (“voiture”).
But when you treat language — and natural language processing — exclusively as a rule-based challenge, you encounter a few problems.
First, it’s time-consuming. Somebody has to manually define the countless rules in every language they wish to understand. And these rules differ wildly between languages, and the rules don’t always apply. Most languages have some form of irregularity (including English).
Moreover, people don’t always speak or write using the “standard” (a term preferred by linguists instead of “correct”) form of their language. How often have you heard someone say “we was” instead of “we were?” Or used a non-standard contraction of a phrase, like “gotta” instead of “got to?” Language is fundamentally messy, and a rule-based approach struggles to capture all these variations.
Finally, an exclusively rule-based approach struggles to identify and discern the meaning behind a sentence, or an utterance. Language is more than just syntax and grammar. It’s a tool for communication.
Statistical and neural methods in NLP
To address these challenges, the computational linguistics field began using statistical methods to understand language in the late 1980s. These methods can be (and often are) used alongside symbolic methods.
Statistical NLP tries to understand the meaning behind language, rather than just how language is constructed. Words are identified, placed within a taxonomy of meaning, and given a weighting as to their relevance and the probability they’ll appear next to another word.
Take the verb “eat” for example. There are a finite number of nouns that will likely follow this (“lunch” or “food”). The word “feeling” is often followed by “fine” or “sick.” And so, by understanding these relationships, we can build systems that can infer the meaning behind human language.
In recent years, these statistical models have become increasingly capable, in part due to advancements in AI and machine learning, but also due to the availability of large, open training datasets. The abundance of this data has similarly allowed a new discipline within natural language processing to emerge: Neural NLP.
Neural NLP uses neural networks to process, parse, and understand text. A neural network is a computerised representation inspired by how the brain processes information, and consists of layers of interconnected artificial “neurons.”
Each network consists of three elements. First, there’s the input layer, which receives data and pre-processes it for analysis. This data then travels to the “hidden layers.” A network may include several hidden layers. Each receives data, analyzes it, and then passes it to the next layer. Eventually, the output layer returns the final result of the data processing.
Neural networks have proven capable at inferring the meaning and context of words, but require vast quantities of data to work effectively. The model that first powered ChatGPT, for example, used a 45GB corpus of training data from Common Crawl — a nonprofit organisation that scrapes the web to create large datasets that can be used in AI/ML applications.
Natural Language Processing in Real Life
The field of NLP has advanced significantly since the first early experiments in the 1950s. Thanks to a combination of academic research, commercial investment, large (and free) datasets, faster computers, bigger neural networks and better neural network architectures. NLP-based applications are both ubiquitous and highly potent. We use them in our daily lives — often without giving them a second thought. Examples include:
- Generative AI chatbots like ChatGPT, which can understand written questions and produce custom pieces of text in response.
- Sentiment analysis tools, allowing marketers and pollsters to understand how the public perceives an issue, a product, or a politician at scale.
- Text completion tools, like the ones used on your phone, or within services like Gmail and Google Docs. By understanding the meanings of individual words, they can suggest the next word that should appear in a sentence, or how to finish a sentence.
- Spellchecking tools use NLP to highlight words that, although spelled correctly, may seem out-of-place.
- Language translation services, like Google Translate.
- Voice assistant services, like Amazon’s Alexa, Google Assistant, Microsoft Cortana, or Apple’s Siri.
- Spam and phishing filtering tools. Looking at the composition of an email is one (although not the only) way to determine whether it’s genuine or malicious.
Thanks to the success of ChatGPT, more organisations are including NLP-based technologies in their products. Examples include Snapchat, Grammarly, DuckDuckGo, and Discord.
How NLP works
Modern statistical NLP methods try to infer meaning and intent from text by looking at the words in a text, the frequency in which they appear, and their relationship to other parts of the sentence. With this information, it can then make predictions about the underlying meaning and purpose of the content, or the next likely phrase or sentence.
The first stage of this process involves converting the text into a format suitable for computational analysis. This may include:
- Stemming: Here, words are converted into their most base form. These base stems aren’t necessarily real words. For example, when stemming “produces,” “producing,” and “produced,” you would get “produc.”
- Lemmatisation: This is a more formal way of stemming, using a dictionary and the word’s morphology to identify the root stem.
- Sentence segmentation: Here, a larger body of text is broken into meaningful component sentences. Modern languages make this easy, with punctuation (like a period, question mark, and exclamation point) delineating sentences. Some older written languages — including Japanese, Arabic, and Chinese — did not include these features, presenting a challenge for researchers.
- Stop word removal: Here, a text pre-processor removes common stop words (like ‘a,’ ‘the,’ and ‘an’). These words often don’t add any additional meaning to the text, and by removing them, you reduce the time it takes to train the NLP model. Removing stop words isn’t always advisable, particularly in sentiment analysis tasks, where they can convey meaning. For example, removing the stop word “not” in the phrase “not good” reverses its meaning entirely.
- Tokenization: In this phase, the text is broken up further into its component word and word fragments, where they can be quantified, classified, and processed.
With the text converted, an AI developer or researcher is now able to feed it into a machine-learning algorithm or model. These include:
- Support vector machines (SVMs, also known as support vector networks, or SVNs): This category of algorithms is used for classification, regression (determining the relationship between elements), and outlier detection.
- Bayesian inference: These ML algorithms use prior knowledge or assumptions to make decisions.
- Hidden Markov models: These models are used in part-of-speech tagging, which allows a computer to identify words with a grammatical purpose, and, using the language around it for context, predict what comes next.
- Cache language models: Used extensively in language-centric tasks, these seek to identify the probability of a word (or sequence of words) appearing in a phrase or sentence.
- Artificial neural networks: This prominent subset of ML uses models that try to emulate the way brains operate, particularly the relationships between neurons. Artificial neural networks can “learn” without needing a task explained or described by a human operator. All recent powerful NLP models are powered by neural networks.
This list isn’t exhaustive. Given the rapid development of AI/ML, it’s inevitable that other approaches, models, algorithms, and technologies will emerge in the coming years, allowing researchers and developers to tackle new or difficult problems, or build NLP applications with greater accuracy or efficiency.
And yet, they’re worth noting, if not to illustrate the different approaches available to those in the AI/ML space, and the diverse ways in which they operate. The concept of NLP isn’t new, but rather something that has evolved over 75 years. This diversity is a consequence of that gradual improvement, the growing ability of computers to handle computationally expensive tasks, and the need for different solutions to different problems.
Why NLP matters
Computers have always been great at crunching numbers. It’s this ability that landed a spaceship on the moon, keeps us safe online, and allows for things like video games. But people don’t speak in numbers.
Human language is messy. It’s complicated. But it’s also something that comes natural to us. NLP is, in essence, a way to bridge the gap between people and machines. This, ultimately, will allow for computers to become way more personal, and thus, more useful.
ChatGPT provides a vision of what this highly personal world of technology will look like, where computers understand the subtleties and nuances of written and spoken text, and can respond in ways we natively understand.
It’s highly likely that the way we use computers will meaningfully change in the coming decades, with menus and dialog boxes losing relevance to the spoken or written word. Faced with this change, it’s important to understand how these systems actually work, even if it’s on a relatively superficial level.
And that’s because NLP and the broader AI/ML push promises to change our relationship with technology. Computers will have to “think” more, and people will have to place their trust in their ability to operate reliably and accurately. And the only way to trust something is to have an informed understanding.