Reading, Writing, Arithmetic, Robotics: What to Know About Machine Learning in 2021 Part II
This is the second article in a series dedicated to the wide-ranging and various aspects of machine learning. Today’s article will focus on a staple of any machine learning algorithm’s diet: Data. We’ll give you a rundown of Big Data, as well as briefly outline the distinction between the two major types of data encountered by ML algorithms, which are unstructured and structured data.
Data is one of the most valuable resources for a company. Without it, many business owners would have to rely on guesswork, rather than educated guesswork, to develop their marketing strategy. Knowing how many customers you can expect to have on a given day, which products or services are the most popular, and your overall revenues are just a few examples of the datum you rely on to keep your business alive and well in an ever-competitive market.
It’s not just humans that need data to accomplish their tasks. In fact, without data, most AI agents would be totally useless. Machine learning as a concept is defined by the ability to make sense of new, unfamiliar input data. If your phone doesn’t hear you say “cheesy bread,” then the ML algorithm responsible for delivering you ad content won’t ensure that you see a Domino’s ad between the pictures of weddings and puppies on your Facebook wall.
The idea here is simple: with no input data, there’s nothing to learn from, much as how a child with no textbook or lecturing teacher won’t learn how to multiply fractions, as there is no “input data” being delivered to them.
“Data” is a general term, as you could probably guess by the fact that the spoken words “cheesy bread” and the instructions for multiplying fractions both fall under the umbrella category of data. In a way, the variety of AI agents and ML algorithms is explained by the variety of data out there, as an ML algorithm that analyses the protean price fluctuations of bronze metal will have a different end goal than the ML algorithm that analyses your customers’ Facebook posts.
The first “type” of data that you should know is the unimaginatively named big data, which refers to sets of data that are far too large and complex to be treated by a traditional data-processing application, let alone a human being.
There are four characteristics that a ML algorithm should possess if it is analyzing big data: It can handle the processing and movement of large volumes at a high velocity of speed, and it can handle an ever-expanding variety of data sources (e.g., multiple voices in a room saying multiple things besides “cheesy bread”). The fourth characteristic more depends on who is doing the input, which is veracity, or the truthfulness of algorithms.
There is a golden rule in data processing that if the data source is untrustworthy, then the data is untrustworthy. Data must be accurate, and meaningful.
Unstructured vs. Structured Data
This distinction is one of the crucial differences in data analysis.
Structured data has a predefined length and format. It is stored in traditional relational databases. An example of this would be weather data, as sensors can be deployed to collect data on temperature, wind, barometric pressure, and precipitation.
We mentioned above that data must be accurate and meaningful for it to be useful to an ML algorithm. This is generally true, and structured data indeed has meaning. However, in the case of unstructured data, the data may be meaningful, but the meaning is unknown to the AI agent.
Unstructured data sources are sort of underused in the data processing world, but they are important nonetheless. While structured data will have a defined format, unstructured data lacks a predefined meaning.
Examples of unstructured data include text internal to a company like surveys and emails. Another example would be photos. If you were to assign a ML algorithm the task of categorizing animals based on only their photos, then that algorithm would be dealing with an unstructured data set. It doesn’t know beforehand what characteristics belong to a lion (or what a “lion” is, even), and will group the photos based on observed similarities. This practice of grouping data together based on observed similarities is called “clustering,” and is a common method in dealing with unstructured data.
Data and AI Summary
Data is as valuable as gold for a company and as essential as breathing for an ML algorithm. Most algorithms deal with big data, or large sets of complex data that a human can’t plausibly handle. Structured data comes with a predefined meaning and is easier to read for an AI agent, while unstructured data requires an agent to work in the dark and use more intuitive reasoning practices, like clustering, to accomplish its goal.