This is the fourteenth article in a series dedicated to the various aspects of machine learning (ML). Today’s article will continue the discussion of natural language processing (NLP) by covering a few of the essential components that go into the development on an NLP agent like Alexa.

Many Americans are required to take a second language at some point in their educational careers. For a native English speaker, you may not realize how bizarre, frustrating, and downright weird that all languages can be until you are made to take one of these classes. A few of the frustrations can be: Daunting and boundless vocabularies, standard grammar rules that have plenty of formal and informal, not to mention region-based, variations, and countless context-dependent connotations that native speakers will need to clue learners in on.

Foreign language classes teach us that it is indeed true that a rose is a rose by any other name, specifically “rosa,” “rozi,” “ruusu,” “mawar,” “waridi,” “atirgul,” among other denominations. These are just a few examples of the ways people try to capture this particular plant with a word, and seeing an extended list could make you frustratedly shout Gertrude Stein’s quote that “a rose is a rose is a rose is a rose.” 

Learning an entirely new foreign language can be difficult, but it is easy when analogues for most foreign vocabulary words can be found in the student’s native language. So, an English speaker will already know the word “rose,” and what that word represents, and will simply carry that representation over when learning the Italian word “rosa.” For most people, a “rose” brings to mind the image of a red plant with soft petals and a thin but hard green stem. Even if you are blind, you will still understand what colors are for the sighted, and will keep that same definition, although your personal association with the plant may be more tied into the sensation of the rose petals. No matter who you are, your native language provides a touchstone for the learning of another language. 

 

Teaching Computers the Basics

 

We mentioned in the last article that computers in general operate with more primitive languages like first order logic, rather than natural/human idioms like English, which makes the teaching a computer a bit more involved than the teaching of a human. But still, the basics things that must be learned are the same. There are two core things that anyone, humans and computers included, must understand about a language in order to know it: Semantics (the meaning[s] behind a word) and syntax (the appropriate arrangement of words in a sentence/phrase). 

 

You can teach a computer a vocabulary of words and grammatical rules pretty easily, but using them and effectively putting them to use is a harder task, for the context-dependent issues we enumerated earlier in the article. Keep reading for a breakdown of a few of the more common types of learning models used in the NLP training process. 

 

Learning Models for Word Representation

 

Making computers understand the significance of the word “rose” within the context of a sentence of a natural language, as well as understand the sense or nonsense made by a certain sentence, is the task of a language model. A language model is a probability distribution that helps computers predict the “likelihood” of certain sentences, such as “A rose is a rose by any other name,” or “A is name rose rose by a other any.” 

 

The rose sentence example above would say that the former sentence is more likely to be encountered in a context with humans. A learning model teaches the agent how humans represent ideas through certain language structures. The agent learns that the phrase “red rose” is more likely to appear in conversation than “rose red,” simply because the former is used more often. Put simply, these learning models tell agents which phrases and words they are most likely to see. 

 

Learning Models for Grammar

 

Learning models for word representation may teach agents which words and phrase structures are most likely to occur within a given language, but it doesn’t necessarily tell the agent why words are stringed together in particular ways (“I met you”) and not others (“Met you I”). Grammar is the ruleset that an ML agent must use in order to be successful at NLP, since a language is pretty much defined by the grammatical rules that inform the structuring of its words. 

One learning model used for teaching computers grammar is probabilistic context-free grammar (PCFG), teaches computers the rules of grammar by teaching it the probabilities of things like noun phrases or verb phrases without any context, giving it a taste for the proper stringing together of words by showing it the likelihood of certain structures (like an adjective followed by a noun which is followed by a verb, e.g. “Bad dogs bite.”), rather than just the likelihood of certain words (“bad,” “dog,” “bite,”) occurring in a certain order. 

 

Parsers

Parsing helps agents further understand the phrase structures behind sentences. By simply observing the occurrences of phrase structures in a sentence, the agent gets better at predicting on its own which phrases are “correct” and which shouldn’t be used. 

This method is similar to how a kid, on its own, can figure out how to speak a language. It observes what is being used, picks up on the subtleties of sentence structure, and internalizes it. 

 

Machine Learning Summary

It is the complicated, context-dependent parts of language that makes learning a language difficult for computers. Learning models help computers learn the semantics and syntax of any language, with degrees of help. Word representation tells agents which phrases and words are most likely to be seen, grammar models help agents learn which phrase structures generally apply across all sentences, and parsing is a process that lets agents observe for themselves how words and sentence structures are applied, and internalize what is learned much like a little kid does when observing humans speak.