This is the sixteenth article in a series dedicated to the various aspects of machine learning (ML). Today’s article will introduce one of the most innovative and essential fields of AI: computer vision (CV). 

Like language, the ability to see is something that many people just take for granted, and believe it to be an essential part of their life experience. A life without vision, or language for that matter, would be a strange life indeed, yet a good deal of people go through their daily life without being able to see. Similarly, cavemen were able to live their lives in a society without languages as complex as French or English or Swahili, but communicating rather with proto-languages. 

We noted in our previous articles on natural language processing that computers do not naturally understand human languages, but rather must be taught them through machine learning processes. Similarly, a computer does not naturally “see,” even if it is equipped with a camera. Just like speaking English, learning to perceive is something that must be taught to a computer. 

Think of Domino’s R2 robot—the hero of this machine learning series—and how it must analyze the surrounding stimuli it takes on during a typical delivery. From dogs to cars to crosswalk signs, R2 must be able to perceive with a faultless clarity what is happening around it in order to make the right decisions, and not endanger itself or anyone/thing around it due to misperception. 

It is not just seeing that a computer must learn, but hearing, touching, and feeling the temperature are just a few things a computer must be taught through machine learning methods. The field dedicated to teaching computers how to perceive things is called computer vision, and we’ll give you a rundown of its most important points below. 

AI See You: Computer Vision

The ultimate goal of computer vision is information extraction, which is concerned with figuring out what is important about any particular stimulus. The delivery robot R2 will be concerned with questions such as, “Is the running dog headed straight for me, or will we not collide?” or, “How much time is left for me to cross the street?” Computer vision will allow these questions to be answered. 

For information extraction, an agent needs to be able to create a model that represents what it is seeing. There are two general approaches to computer vision modeling: object modeling and rendering modeling. 

Object models can be detailed, like a geometrical representation of a space, or ambiguous, like an assertion that Siberian Huskies and Northwestern Wolves look alike at a low enough resolution. 

Rendering models make precise models of often ambiguous input, representing the world with clues from lighting, texture, shading, other aspects of a scene. These models can run into the problem of ambiguity, such as figuring out whether the animal running towards R2 is a Siberian Husky or a Northwestern Wolf, because it is hard to tell because the shade of the awning of a pawn shop is covering the animal. However, any good agent will side with the interpretation that it is a Siberian Husky, because wolves don’t tend to run rampant on the sidewalk in populous town spaces. 

This brings us to computer vision’s biggest concerns: Reconstruction and recognition.

The former deals with making a model of what it observes in the world, like representing a running animal through a camera on R2. Reconstructions can capture the precise details of an image, like textures and boundaries of objects.

The latter is concerned with differentiating among the various things, like dogs and wolves, it encounters in the world.The process of recognition often involves neural networks, that staple of deep learning. NN’s will work to classify input images with a hopefully high degree of accuracy, producing output images that can correctly label an animal as a dog or a wolf. Further, such recognition algorithms can be used as real-time object detectors, where multiple classification algorithms will differentiate between the different objects in any given scene and label each object accordingly. 

Summary

Computer vision is one of the fastest-growing fields of AI. As it is with language, perception tends to come pretty easily to able-bodied people, but it requires quite a bit of work for a computer to correctly perceive what is in an environment, and the relationships between everything seen. Computer vision is the field dedicated to making AI agents adept at perception. The world is represented primarily through object models and rendering models. The goals of computer vision is accurate and reliable reconstruction and recognition, which deal respectively with the representation and classification of the objects in any space.