Connecting the Far-Flung Data Dots

How AI and Natural Language Processing Make Sense of Unstructured Data

Everybody wants data these days. And there is no shortage of it. The digital world tracks our every move. Social media apps provide analytics for every type of engagement. People are producing written, visual, and audio content at an unprecedented rate.

As the amount of data increases, so do the possibilities for meaningful analysis and insight. Yet collecting, combining, and cleaning massive amounts of key research data, especially from various sources, is time-consuming.

This begs the question: Is it possible to replace or augment tedious manual processes with an automated, machine-driven process and retain or even improve data quality and accuracy?

We think so.

At Arboretica, we’re putting our advanced natural language processing (NLP) techniques to work connecting key information from a large amount of unstructured sources, with success in both accuracy and automation levels.

How are we doing it and who is reaping the benefits? Let’s break it down.

First, some definitions.

What is unstructured data?

Let’s start with structured data: it is numbers with patterns or “structure” that makes it easily comprehensible. For example: a table of the daily average temperatures for a city.
Unstructured data, on the other hand, is everything else – think images, voice, video, and text. Typical examples of unstructured data include text files, social media data, mobile communications, and media files.

What is natural language processing (NLP)?

Natural language processing (NLP) is a field of artificial intelligence in which computers analyze, understand, and derive meaning from human language in a smart and useful way. It’s one of the more challenging aspects of computer science because human language is rarely precise, or plainly spoken. To understand human language is to understand not only the words, but the concepts and how they’re linked together to create meaning. Despite language being one of the easiest things for the human mind to learn, the ambiguity of language is what makes natural language processing a difficult problem for computers to master.

How do we use NLP to make sense of unstructured data?

Put in the very simplest of terms, our technology scans large volumes of disparate source material to extract the data most relevant to a particular topic. Of course, when we peel back the layers, our technology, and the problems it solves, are significantly more complex.

Here’s an example.

In a recent engagement, a client needed to extract numbers and geolocations from over 500 scientific papers and match the data into predefined variables. The required data was in different formats (eg, numerical and alphabetical), not to mention in different units (eg, kilometers and miles). The challenge was multifold: the sheer volume of information, the different types of source material, the different types of data, all needing to be matched to a common denominator for analysis.

Arboretica’s advanced natural language processing technologies were up to the challenge: we scanned, extracted, and matched the unstructured data from 500 sources in just two weeks.

Here’s how we do it.

First, we identify the universe of source material in which we are looking for our target information: should we look for it in journal articles, social media posts, videos, or something else entirely?

Next, we scan the source material for its relevance to the target search terms. We use a highly automated hierarchical process in this evaluation, scanning first the metadata to determine relevancy; then the main text; then running an automated web search to find other related material.

Once we have the data, our technology adds contextual information such as the data’s source, date, geolocation, and more.

Then, when the volume of data demands it, we train machine-learning algorithms to match data to the common denominators for a specific variable.

And finally, our algorithms identify and extract all table and text data related to a target keyword. Our clients receive all of the data they need in a spreadsheet.

Along the way we iterate with the client to ensure that our technology is delivering not only accurate results, but value. Are we actually saving our clients’ time? Are we surfacing the right insights, or insights that otherwise wouldn’t be as easily identifiable? Are we making their analysis easier? With many clients, any time saved in data review and extraction is considered a win given their current manual processes.

Who needs to extract and connect unstructured data?

Anybody. Any organization that is seeking to derive value from its data can benefit from using machine-learning intelligence to augment traditional manual efforts. At Arboretica, we have worked with companies in a multitude of industries as diverse as watch making, train manufacturing, and environmental policy writing.

The results (and the caveats)

There is no expectation that our technology can – or should – completely replace traditional data review, extraction, or analysis. Human review is almost always a requisite of any machine-learning process to ensure that algorithms are surfacing the right insights from the right material.

But our solutions can save you significant time, freeing your resources to focus on other data initiatives, and can improve the accuracy of the information being used in your analysis.

If you think your organization can benefit from the power of Arboretica’s AI solutions, contact us today.