News
|
February 25, 2020

Structured, unstructured, and everything in between

All Industries
Back to All News
Structured, unstructured, and everything in between

In this post we explain everything in between structured and unstructured data and what are the key areas to consider.

Structured content is your SQL bound data sets that live in organized systems like ERP. It’s easily extracted, organized, and ripe for analytics.

Unstructured on the other hand, refers to things like word processing docs, well logs, contracts, submissions, and the like. Certainly these are not database-prone assets and require a degree of data processing, or other such treatments, in order to find, filter, and focus on the relevant information contained within.

But is it really black and white? Is content just structured and unstructured, or are there different shades in between? Let's consider the following.

Highly unstructured content

To the extreme side of unstructured we see organizations grappling with highly unstructured content.  Stuff like social media posts, random paragraphs, and even the content of email.  Here the very idea of structure is absent, making it that much harder, but not impossible, to analyze. It turns out that within the chaos, you can interpret some amount of order by using text analytics and natural language processing technologies, and applying noun-verb breakdowns, sentence order, sentiment indicators, word frequency/predictions, and other techniques to gain insights. Suddenly patterns emerge, and structure seems to appear. The challenge though is that as you go deeper down this rabbit hole you get less and less objective . “Cool” can mean a Canadian winter, an off-putting temperament, or Fonz-like awesomeness. While there are a number of evolving technologies in this space, there remain significant system training requirements, and the results are increasingly spurious.

Semi-structured data

Somewhere in the middle, we might think of semi structured data – the archetypical example being forms. These start off looking like fairly structured data:  10 defined fields, database integration, no big deal. The problem, as we found out during a recent POC with an insurance customer, is that those 10 fields are never quite where you think they should be!  Forms become multiplied across language, version, region, policy, format, paper size, etc.  And all of a sudden “Name” in the upper right, becomes “Nom” in the lower left, and then 2 fields of “nombre de pila / apellido” somewhere in the middle.  Being able to understand this kind of unstructured structure is where file analysis and extraction technologies come into play.

Data vs. content

On the other extreme, in the weeds of structured content, is the notion of structured and unstructured data.  As if we weren’t confused enough with structured/unstructured content, the data we extract, whether it comes from a structured database, or some unstructured document, can itself have a range of structure. Good example of this is a date stamp in Microsoft Excel. An Excel sheet contains structured content... but the data within it may be poorly structured. Looking at a date like Dec 31, 12/31 and 31/12? Where is there structure? Has it been applied properly? Can the next system interpret accordingly? Certainly rapidly growing data preparation and data validation technologies help address this challenge, along with good policy and enforcement.

So where does that leave you? Regardless of where you sit in the organizational structure, there’s lots of opportunity if you can navigate this convoluted world of information. It can help to look for opportunities that result in solid wins for the business, but require minimally complex investments and installations.

If you’re a certified database architect, then fine, go chase data structures. Similarly, if you’re a mathematical doctorate with a linguistics penchant, then perhaps dig into the semantic side of things. But for the rest of us, there’s somewhere around 80% of organizational information that is unstructured content which is constantly being ignored, underused, and not leveraged to its full potential. Technologies like Adlib's document and data transformation platform can help organizations like yours to take advantage of that low-hanging, and potentially high-value fruit.

News
|
September 5, 2024
A Simple Guide to Making Your Business Run Smoothly During OpenText Blazon and Content Server Sunset
Learn More
News
|
September 2, 2024
Gain Control of Unstructured Data with Intelligent Data Extraction
Learn More
News
|
August 26, 2024
Why Stateless LLMs Are the Best Choice for Secure Data Extraction
Learn More

Schedule a workshop with our experts

Leverage the expertise of our industry experts to perform a deep-dive into your business imperatives, capabilities and desired outcomes, including business case and investment analysis.