Natural language processing (NLP) is the construction of automated ways of understanding human language, as distinct from numerals, images, sounds or other types of data we might process digitally.
Although in recent years many people have been exposed to the outputs of NLP in its spoken form, via Siri, Alexa and other “personal assistant” products, the technology has been applied to text for decades – and this is where Carl Hoffman’s expertise resides.
Hoffman is Cambridge, Massachusetts-based co-founder and CEO of Basis Technology. “With regard to financial services, the greatest untapped potential lies in the written word,” he said.
Despite the vast amount of numeric data within big enterprises, there’s even more in the form of text: reports, email messages, filings, or legal documents. Whereas numeric data is often “structured” (clearly defined and easy to search), text data is not.
Banks and other institutions also have semi-structured data, such as databases (searchable) with large text fields (unstructured); for example, an IPO prospectus may contain plenty of tables about a company’s earnings, balance sheet and cashflow, but its most important content may be statements around risk, corporate governance or use of proceeds.
Origins of NLP
NLP is not new. What is new, is the use of neural networks – but we’ll get to that later.
First, understand that NLP of text has been around in the form of search, either for a keyword on Google or a name in your email contact list. These functions are evolving: Google’s search engine acts as form of answering questions, for example.
Financial institutions are using NLP in areas such as customer service, most of which today is in text form, like the way a “can I help you” box pops up on many banks’ consumer websites. But they are also moving to use NLP for a more dynamic relationship with their customers, such as monitoring social media to understand what people are saying about you.
Parsing this is difficult, given slang and jargon, but in aggregate, it can provide powerful signals. With sufficient volumes, such data can be dissected to understand basic trends among different groups (e.g., males and females, or people in developed versus emerging markets).
In finance, there are plenty of sources of data: transcripts from earnings reports or speeches can fuel a sophisticated NLP to gauge sentiment for a corporation, or around an action (such as market reactions to a hike in interest rates, or a stock split).
“Technology gives us the ability to understand the objective meaning of words, in the context we care about,” Hoffman said. A person understands the gist of an earnings-call transcript, in a way that a computer cannot, but not whether it will impact the price of a stock – that has remained a subjective determination.
But with NLP, as it gets better at connecting words to the aspect of text that an analyst cares about, qualitative judgments can be made in a quantitative manner. “NLP has the potential to take things out of the world of guessing,” Hoffman said.
To do so, NLP has to first be able to read a text and understand it in context – to get its gist, as a human would.
Here’s how it works. Let’s say we want to compare two speeches by different central-bank governors – say the Fed’s Jerome Powell and Mark Carney of the Bank of England – to divine their views of interest rates and understand whether they differ or agree.
A simplistic natural-language process begins by counting the number of letters in each speech: each has so many letter As, Bs, etc, which goes into a histogram that charts letter frequency. Ideally a computer needs lots of data, so it might want to do this for as many of Powell and Carney’s speeches as possible. Although this is a primitive form of machine learning, the result will show Powell and Carney prefer different words to describe a concept. But any signal would be weak.
How to improve this? Data scientists might next focus on pairs of adjacent letters, what linguists call a bigram. For example, what letters typically follow someone’s use of “the”? Is it words that begin with the letter A, or P? (P then often being followed by an H.)
For any given letter in a bigram, there are 26-squared possible outcomes. A computer can now create a histogram of 262points for both Powell and Carney. Now we have a more precise sense of how the two differ in their choices of words. This can be refined yet further, by introduce n-grams, or continuous sequences of “n” number of items, which can be letters but also words or syllables, for example.
This evolves all the way from using NLP to establish sequences of letters of varying frequency, to how different speakers position clauses or use prepositions – and thus enable us to start quantifying quirks of speech, so the computer understands what Powell might mean by citing a “home run” or if Carney warns of a “yellow card”. And in the process of that, the computer generates a reliable signal to distinguish meaning between the two transcripts.
This relies on machine learning – feeding the computer massive volumes of inputs, such as central banker speeches. This has been how artificial intelligence has progressed for decades, be it for text or training a computer to tell the difference between images of cats versus dogs.
In the old days, such learning was based on rules. For the past two decades it has relied on statistical machine translation (i.e., the computer figures it out on its own through trial and error, not by following rules). But NLP got a massive boost starting in 2011, when Google launched Project Marvin, introducing neural networks to train computers. Now all kinds of A.I.-based projects are based on this.
“Neutral nets have radically transformed everything to do with NLP,” Hoffman said.
A neural network is a computing structure, likened to a black box, with inputs and outputs, designed to mimic biology rather than just rely on math. Nodes and connections create a network akin to the neurons in our brains. What varies among neural networks is the ways these components are put together.
How does this work in practice? Go back to our basic example of building a histogram based on each speaker’s frequency of letters – Powell uses a lot more As than Carney, who favors Es and Ps.
In a neural network, there will be an input counter, or node, tallying the sequence of letters coming into the system; there will also be a “disambiguator” node, which in computer science means software to determine the intended meaning of a word or phrase.
Inside the black box
The single-letter analysis would provide one “layer”. Add a bigram, you’ve got a second layer. A neural net with just a handful of layers is considered shallow. But one with many layers, which interconnect upon each node, creates a complexity that is meant to mirror the trillions of synapses in the human brain. These layers refine and alter the context of what goes into the network.
Businesspeople think of the network as a black box. It’s got inputs, and it’s designed in a way to interpret this layering – how exactly we don’t know – and then it spits out results. This approach is harder to analyze but its results tend to be more accurate with larger amounts of data. The practical result is that these networks need less human labor.
But deep neural networks also have new costs, including computing processing: running a neural network on conventional CPUs (central processing unit, a computer’s circuitry to carry out programs) takes 6x to 30x longer than legacy versions of machine learning. Firms therefore either need a lot more computers, or to invest in GPUs (graphics processing unit) that are designed for such intensive work.
The problem with black boxes, of course, is that it’s difficult to understand why the network gives a particular output. Hoffman says companies need to spend resources up front to engineer their neural networks for accountability, auditability and transparency – especially if there’s a risk that a machine’s output could land a company with a lawsuit.
“It’s easy to build unexplainable A.I.s,” Hoffman said, but this can lead to disasters such as chatbots spewing racist cant or autonomous cars crashing into pedestrians. The impact on a portfolio running on neural networks, for example, could be huge if the machine loses lots of client money.
“Explainability is high on the priority list of everybody working in A.I.,” Hoffman said, noting that many vendors are working on tools and architectures to see how the machine arrives at a particular answer. “There will be surprises, and we need to understand what’s going on.”